Vyacheslav Koptilin created IGNITE-13193:
--------------------------------------------
Summary: Implement fallback to full partition rebalancing in case
historical supplier failed to read all necessary data updates from WAL
Key: IGNITE-13193
URL: https://issues.apache.org/jira/browse/IGNITE-13193
Project: Ignite
Issue Type: Improvement
Affects Versions: 2.8.1
Reporter: Vyacheslav Koptilin
Assignee: Vyacheslav Koptilin
Historical rebalance may fail for several reasons:
1) WAL on supplier node is corrupted - the supplier will trigger a failure
handler in the current implementation.
2) After iteration over WAL demander node didn't receive all updates to make
MOVING partition up-to-date (resulting update counter didn't converge with
expected update counter of OWNING partition) - demander will silently ignore
lack of updates in the current implementation.
Such behavior negatively affects the stability of the cluster: an inappropriate
state of historical WAL is not a reason to fail a supplier node.
The more proper way to handle this scenario is:
- Either try to rebalance partition historically from another supplier
- Or use full partition rebalance for problem partition
Once the supplier fails to provide data from part of the WAL, its corresponding
sequence of checkpoints should be marked as inapplicable for historical
rebalance in order to prevent further errors.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)