[
https://issues.apache.org/jira/browse/IGNITE-13193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150228#comment-17150228
]
Vladislav Pyatkov commented on IGNITE-13193:
--------------------------------------------
[~slava.koptilin] I left three comments in PR.
Please look at those.
> Implement fallback to full partition rebalancing in case historical supplier
> failed to read all necessary data updates from WAL
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: IGNITE-13193
> URL: https://issues.apache.org/jira/browse/IGNITE-13193
> Project: Ignite
> Issue Type: Improvement
> Affects Versions: 2.8.1
> Reporter: Vyacheslav Koptilin
> Assignee: Vyacheslav Koptilin
> Priority: Major
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Historical rebalance may fail for several reasons:
> 1) WAL on supplier node is corrupted - the supplier will trigger a failure
> handler in the current implementation.
> 2) After iteration over WAL demander node didn't receive all updates to make
> MOVING partition up-to-date (resulting update counter didn't converge with
> expected update counter of OWNING partition) - demander will silently ignore
> lack of updates in the current implementation.
> Such behavior negatively affects the stability of the cluster: an
> inappropriate state of historical WAL is not a reason to fail a supplier node.
> The more proper way to handle this scenario is:
> - Either try to rebalance partition historically from another supplier
> - Or use full partition rebalance for problem partition
> Once the supplier fails to provide data from part of the WAL, its
> corresponding sequence of checkpoints should be marked as inapplicable for
> historical rebalance in order to prevent further errors.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)