[jira] [Updated] (IGNITE-17793) Historical rebalance must use HWM instead of LWM to seek the proper checkpoint

Anton Vinogradov (Jira) Fri, 30 Sep 2022 11:37:09 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-17793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anton Vinogradov updated IGNITE-17793:
--------------------------------------
    Description: 
Currently, historical rebalance at 
{{CheckpointHistory#searchEarliestWalPointer}} seeks for the newest checkpoint 
with counter less that lowest entry has to be rebalanced.

Unfortunately,

1) We may have more that one checkpoint with the same counter and it's 
impossible to use the newest one as a rebalance start point.

For example, we have partition with LWM=100, some gaps and HWM=200.
Checkpoint will have the counter == 100.
Then we may close some gaps, exluding 101 (to keep LWM == 100).
And again, checkpoint will have counter == 100.
Newest checkpoint marked with counter 100 will not cointain all committed 
entries with counter > 100.
And after the rebalance finish, we'll wee a warning "Some partition entries 
were missed during historical rebalance" and inconsistent cluster state.

2) After the cluster restart with counters finalization, we may face a 
situation that we have checkpoints before some counter but none of them can be 
used for rebalancing.

For example, we, again, have partition with LWM=100, some gaps and HWM=200.
Restarting the cluster and first checkpoint marked at counter == 100.
Then we able to finalize the counters to make historical rebalance possible.
But this single checkpoint does not contain some committed entries with counter 
> 100, only rollback entries after the finalization.
Every entry inserted after the restart, but before finalization may be lost.

Possible solution is to use HWM instead of LWM during the search.

  was:
Currently, historical rebalance at 
{{CheckpointHistory#searchEarliestWalPointer}} seeks for the newest checkpoint 
with counter less that lowest entry has to be rebalanced.

Unfortunately, 

1) We may have more that one checkpoint with the same counter and it's 
impossible to use the newest one as a rebalance start point.

For example, we have partition with LWM=100, some gaps and HWM=200.
Checkpoint will have the counter == 100.
Then we may close some gaps, exluding 101 (to keep LWM == 100).
And again, checkpoint will have counter == 100.
Newest checkpoint marked with counter 100 will not cointain all committed 
entries with counter > 100.
And after the rebalance finish, we'll wee a warning "Some partition entries 
were missed during historical rebalance" and inconsistent cluster state.

2) After the cluster restart, we may face a situation that we have checkpoints 
before some counter but none of them can be used bor rebalancing.

For example, we, again, have partition with LWM=100, some gaps and HWM=200.
Restarting the cluster and first checkpoint marked at counter == 100.
But this single checkpoint does not contain some committed entries with counter 
> 100.

Possible solution is to use HWM instead of LWM during the search.


> Historical rebalance must use HWM instead of LWM to seek the proper checkpoint
> ------------------------------------------------------------------------------
>
>                 Key: IGNITE-17793
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17793
>             Project: Ignite
>          Issue Type: Sub-task
>            Reporter: Anton Vinogradov
>            Priority: Major
>              Labels: iep-31, ise
>
> Currently, historical rebalance at 
> {{CheckpointHistory#searchEarliestWalPointer}} seeks for the newest 
> checkpoint with counter less that lowest entry has to be rebalanced.
> Unfortunately,
> 1) We may have more that one checkpoint with the same counter and it's 
> impossible to use the newest one as a rebalance start point.
> For example, we have partition with LWM=100, some gaps and HWM=200.
> Checkpoint will have the counter == 100.
> Then we may close some gaps, exluding 101 (to keep LWM == 100).
> And again, checkpoint will have counter == 100.
> Newest checkpoint marked with counter 100 will not cointain all committed 
> entries with counter > 100.
> And after the rebalance finish, we'll wee a warning "Some partition entries 
> were missed during historical rebalance" and inconsistent cluster state.
> 2) After the cluster restart with counters finalization, we may face a 
> situation that we have checkpoints before some counter but none of them can 
> be used for rebalancing.
> For example, we, again, have partition with LWM=100, some gaps and HWM=200.
> Restarting the cluster and first checkpoint marked at counter == 100.
> Then we able to finalize the counters to make historical rebalance possible.
> But this single checkpoint does not contain some committed entries with 
> counter > 100, only rollback entries after the finalization.
> Every entry inserted after the restart, but before finalization may be lost.
> Possible solution is to use HWM instead of LWM during the search.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17793) Historical rebalance must use HWM instead of LWM to seek the proper checkpoint

Reply via email to