[ 
https://issues.apache.org/jira/browse/IGNITE-11555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Goncharuk reassigned IGNITE-11555:
-----------------------------------------

    Assignee: Alexey Goncharuk

> Unable to await partitions release latch caused by coordinator failover
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-11555
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11555
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alexey Goncharuk
>            Assignee: Alexey Goncharuk
>            Priority: Critical
>             Fix For: 2.8
>
>
> Currently exchanges latches (both server and client) are deleted when the 
> latch is completed. This leads to a hang in the following scenario:
> 1) A grid with several nodes starts exchange latch sync
> 2) All nodes send acks to coordinator
> 3) Coordinator processes acks and sends final acks to some of the nodes
> 4) These nodes process acks, complete and delete client latches
> 5) Coordinator fails
> 6) Nodes which did not receive final acks re-send the ack to a new coordinator
> 7) Since the new coordinator already completed and deleted the client latch, 
> it does not process re-sent ack correctly and only adds it to the pending 
> messages.
> Looks like the root cause of this issue is latch deletion on final ack. We 
> can safely delete the latch only when all nodes are guaranteed to process the 
> messages. Luckily, since the latch is tied to the exchange process, we can 
> safely delete the latch when the corresponding exchange completes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to