[jira] [Commented] (IGNITE-11555) Unable to await partitions release latch caused by coordinator failover

Alexey Goncharuk (JIRA) Tue, 19 Mar 2019 01:25:08 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-11555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795820#comment-16795820
 ]


Alexey Goncharuk commented on IGNITE-11555:
-------------------------------------------

[~Jokser] can you take a look at the changes?

> Unable to await partitions release latch caused by coordinator failover
> -----------------------------------------------------------------------
>
>                 Key: IGNITE-11555
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11555
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Alexey Goncharuk
>            Assignee: Alexey Goncharuk
>            Priority: Critical
>             Fix For: 2.8
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently exchanges latches (both server and client) are deleted when the 
> latch is completed. This leads to a hang in the following scenario:
> 1) A grid with several nodes starts exchange latch sync
> 2) All nodes send acks to coordinator
> 3) Coordinator processes acks and sends final acks to some of the nodes
> 4) These nodes process acks, complete and delete client latches
> 5) Coordinator fails
> 6) Nodes which did not receive final acks re-send the ack to a new coordinator
> 7) Since the new coordinator already completed and deleted the client latch, 
> it does not process re-sent ack correctly and only adds it to the pending 
> messages.
> Looks like the root cause of this issue is latch deletion on final ack. We 
> can safely delete the latch only when all nodes are guaranteed to process the 
> messages. Luckily, since the latch is tied to the exchange process, we can 
> safely delete the latch when the corresponding exchange completes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-11555) Unable to await partitions release latch caused by coordinator failover

Reply via email to