[jira] [Updated] (IGNITE-17738) Historical rebalance must be able to fix the consistency on cluster restart by itself

Anton Vinogradov (Jira) Thu, 22 Sep 2022 08:05:30 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Anton Vinogradov updated IGNITE-17738:
--------------------------------------
    Description: 
On cluster restart (because of power-off, OOM or some other problem) it's 
possible to have PDS inconsistent (primary partitions may contain operations 
missed on backups).

Currently, "historical rebalance" is able to sync the data to the highest LWM 
for every partition. 
Most likely, a primary will be chosen as a rebalance source, but the data after 
the LWM will not be rebalanced. So, all updates between LWM and HWM will not be 
synchronized.

A possible solution for the case when the cluster failed and restarted (same 
baseline) is to fix counters to help "historical rebalance" perform the sync.

Counters should be set as
 - HWM at primary and as LWM at backups for caches with 2+ backups,
 - LWM at primary and as HWM at backups for caches with a single backup.

Possible solutions:
 * This can be implemented as an extension for the "-consistency finalize` 
command, for example `-consistency finalize-on-restart` or
 * Counters can be finalized automatically when cluster composition is equal to 
the baseline specified before the crash (preferred)

  was:
On cluster restart (because of power-off, OOM or some other problem) it's 
possible to have PDS inconsistent (primary partitions may contain operations 
missed on backups).

Currently, "historical rebalance" is able to sync the data to the highest LWM 
for every partition. 
Most likely, a primary will be chosen as a rebalance source, but the data after 
the LWM will not be rebalanced. So, all updates between LWM and HWM will not be 
synchronized.

A possible solution for the case when the cluster failed and restarted (same 
baseline) is to fix counters to help "historical rebalance" perform the sync.

Counters should be set as
 - HWM at primary and as LWM at backups for caches with 2+ backups,
 - LWM at primary and as HWM at backups for caches with a single backup.

This can be implemented as an extension for the "-consistency finalize` 
command, for example `-consistency finalize-on-restart`.


> Historical rebalance must be able to fix the consistency on cluster restart 
> by itself
> -------------------------------------------------------------------------------------
>
>                 Key: IGNITE-17738
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17738
>             Project: Ignite
>          Issue Type: Sub-task
>            Reporter: Anton Vinogradov
>            Priority: Major
>              Labels: iep-31, ise
>
> On cluster restart (because of power-off, OOM or some other problem) it's 
> possible to have PDS inconsistent (primary partitions may contain operations 
> missed on backups).
> Currently, "historical rebalance" is able to sync the data to the highest LWM 
> for every partition. 
> Most likely, a primary will be chosen as a rebalance source, but the data 
> after the LWM will not be rebalanced. So, all updates between LWM and HWM 
> will not be synchronized.
> A possible solution for the case when the cluster failed and restarted (same 
> baseline) is to fix counters to help "historical rebalance" perform the sync.
> Counters should be set as
>  - HWM at primary and as LWM at backups for caches with 2+ backups,
>  - LWM at primary and as HWM at backups for caches with a single backup.
> Possible solutions:
>  * This can be implemented as an extension for the "-consistency finalize` 
> command, for example `-consistency finalize-on-restart` or
>  * Counters can be finalized automatically when cluster composition is equal 
> to the baseline specified before the crash (preferred)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17738) Historical rebalance must be able to fix the consistency on cluster restart by itself

Reply via email to