[ 
https://issues.apache.org/jira/browse/IGNITE-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703302#comment-16703302
 ] 

Alexey Goncharuk commented on IGNITE-10374:
-------------------------------------------

[~sergey-chugunov], I think with this fix we introduced another race:
1) Let's assume that rebalancing is finished, we triggered a checkpoint, the 
checkpoint is finished as well and we are about to invoke {{ownMoving()}}
2) BLT changes and more partitions in MOVING state are created on the local node
3) The listener goes to the {{ownMoving}} and marks all partitions, including 
new, as OWNING
4) Node restarts and we get desynced partitions

I think a correct fix would be to acquire a group topology read lock inside the 
checkpoint listener? compare {{lastAffChangeVer}} with the version on which the 
rebalance finished and own partitions only when {{lastAffChangeVer}} is not 
greater than the rebalance finish version. 

[~ilantukh], do you think there is another way to fix this race? As far as I 
remember, we discussed an option to capture the list of partitions that were 
rebalanced. This may work here as well.

Also, would be great to add a test for this case, because I remember discussing 
this race before, and there was a big chance this race was not spotted on 
review.

> Node doesn't own rebalanced partitions on rebalancing finished
> --------------------------------------------------------------
>
>                 Key: IGNITE-10374
>                 URL: https://issues.apache.org/jira/browse/IGNITE-10374
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Sergey Chugunov
>            Assignee: Sergey Chugunov
>            Priority: Critical
>             Fix For: 2.8
>
>
> Prerequisite: flag *IGNITE_DISABLE_WAL_DURING_REBALANCING* is set to true 
> (default value is false).
> Scenario:
> * Node joins the grid and starts rebalancing all cache groups from scratch 
> (e.g. all db files of the node were cleaned up during its downtime);
> * One or more client nodes join topology when rebalancing is in progress.
> Expected outcome:
> Rebalance finishes, node owns all received partitions, new affinity is 
> applied.
> Actual outcome:
> Rebalance finishes, but node doesn't own any of received partitions, no 
> affinity changes take place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to