[ https://issues.apache.org/jira/browse/IGNITE-13193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150228#comment-17150228 ]
Vladislav Pyatkov commented on IGNITE-13193: -------------------------------------------- [~slava.koptilin] I left three comments in PR. Please look at those. > Implement fallback to full partition rebalancing in case historical supplier > failed to read all necessary data updates from WAL > ------------------------------------------------------------------------------------------------------------------------------- > > Key: IGNITE-13193 > URL: https://issues.apache.org/jira/browse/IGNITE-13193 > Project: Ignite > Issue Type: Improvement > Affects Versions: 2.8.1 > Reporter: Vyacheslav Koptilin > Assignee: Vyacheslav Koptilin > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Historical rebalance may fail for several reasons: > 1) WAL on supplier node is corrupted - the supplier will trigger a failure > handler in the current implementation. > 2) After iteration over WAL demander node didn't receive all updates to make > MOVING partition up-to-date (resulting update counter didn't converge with > expected update counter of OWNING partition) - demander will silently ignore > lack of updates in the current implementation. > Such behavior negatively affects the stability of the cluster: an > inappropriate state of historical WAL is not a reason to fail a supplier node. > The more proper way to handle this scenario is: > - Either try to rebalance partition historically from another supplier > - Or use full partition rebalance for problem partition > Once the supplier fails to provide data from part of the WAL, its > corresponding sequence of checkpoints should be marked as inapplicable for > historical rebalance in order to prevent further errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)