Vyacheslav Koptilin created IGNITE-13193:
--------------------------------------------

             Summary: Implement fallback to full partition rebalancing in case 
historical supplier failed to read all necessary data updates from WAL
                 Key: IGNITE-13193
                 URL: https://issues.apache.org/jira/browse/IGNITE-13193
             Project: Ignite
          Issue Type: Improvement
    Affects Versions: 2.8.1
            Reporter: Vyacheslav Koptilin
            Assignee: Vyacheslav Koptilin


Historical rebalance may fail for several reasons:
1) WAL on supplier node is corrupted - the supplier will trigger a failure 
handler in the current implementation.
2) After iteration over WAL demander node didn't receive all updates to make 
MOVING partition up-to-date (resulting update counter didn't converge with 
expected update counter of OWNING partition) - demander will silently ignore 
lack of updates in the current implementation.
Such behavior negatively affects the stability of the cluster: an inappropriate 
state of historical WAL is not a reason to fail a supplier node.
The more proper way to handle this scenario is:
 - Either try to rebalance partition historically from another supplier
 - Or use full partition rebalance for problem partition

Once the supplier fails to provide data from part of the WAL, its corresponding 
sequence of checkpoints should be marked as inapplicable for historical 
rebalance in order to prevent further errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to