[
https://issues.apache.org/jira/browse/IGNITE-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16703302#comment-16703302
]
Alexey Goncharuk commented on IGNITE-10374:
-------------------------------------------
[~sergey-chugunov], I think with this fix we introduced another race:
1) Let's assume that rebalancing is finished, we triggered a checkpoint, the
checkpoint is finished as well and we are about to invoke {{ownMoving()}}
2) BLT changes and more partitions in MOVING state are created on the local node
3) The listener goes to the {{ownMoving}} and marks all partitions, including
new, as OWNING
4) Node restarts and we get desynced partitions
I think a correct fix would be to acquire a group topology read lock inside the
checkpoint listener? compare {{lastAffChangeVer}} with the version on which the
rebalance finished and own partitions only when {{lastAffChangeVer}} is not
greater than the rebalance finish version.
[~ilantukh], do you think there is another way to fix this race? As far as I
remember, we discussed an option to capture the list of partitions that were
rebalanced. This may work here as well.
Also, would be great to add a test for this case, because I remember discussing
this race before, and there was a big chance this race was not spotted on
review.
> Node doesn't own rebalanced partitions on rebalancing finished
> --------------------------------------------------------------
>
> Key: IGNITE-10374
> URL: https://issues.apache.org/jira/browse/IGNITE-10374
> Project: Ignite
> Issue Type: Bug
> Reporter: Sergey Chugunov
> Assignee: Sergey Chugunov
> Priority: Critical
> Fix For: 2.8
>
>
> Prerequisite: flag *IGNITE_DISABLE_WAL_DURING_REBALANCING* is set to true
> (default value is false).
> Scenario:
> * Node joins the grid and starts rebalancing all cache groups from scratch
> (e.g. all db files of the node were cleaned up during its downtime);
> * One or more client nodes join topology when rebalancing is in progress.
> Expected outcome:
> Rebalance finishes, node owns all received partitions, new affinity is
> applied.
> Actual outcome:
> Rebalance finishes, but node doesn't own any of received partitions, no
> affinity changes take place.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)