[
https://issues.apache.org/jira/browse/IGNITE-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anton Vinogradov updated IGNITE-17738:
--------------------------------------
Description:
On cluster restart (because of power-off, OOM or some other problem) it's
possible to have PDS inconsistent (primary partitions may contain operations
missed on backups).
Currently, "historical rebalance" is able to sync the data to the highest LWM
for every partition.
Most likely, a primary will be chosen as a rebalance source, but the data after
the LWM will not be rebalanced. So, all updates between LWM and HWM will not be
synchronized.
A possible solution for the case when the cluster failed and restarted (same
baseline) is to fix counters to help "historical rebalance" perform the sync.
Counters should be set as
- HWM at primary and as LWM at backups for caches with 2+ backups,
- LWM at primary and as HWM at backups for caches with a single backup.
Possible solutions:
* This can be implemented as an extension for the "-consistency finalize`
command, for example `-consistency finalize-on-restart` or
* Counters can be finalized automatically when cluster composition is equal to
the baseline specified before the crash (preferred)
was:
On cluster restart (because of power-off, OOM or some other problem) it's
possible to have PDS inconsistent (primary partitions may contain operations
missed on backups).
Currently, "historical rebalance" is able to sync the data to the highest LWM
for every partition.
Most likely, a primary will be chosen as a rebalance source, but the data after
the LWM will not be rebalanced. So, all updates between LWM and HWM will not be
synchronized.
A possible solution for the case when the cluster failed and restarted (same
baseline) is to fix counters to help "historical rebalance" perform the sync.
Counters should be set as
- HWM at primary and as LWM at backups for caches with 2+ backups,
- LWM at primary and as HWM at backups for caches with a single backup.
This can be implemented as an extension for the "-consistency finalize`
command, for example `-consistency finalize-on-restart`.
> Historical rebalance must be able to fix the consistency on cluster restart
> by itself
> -------------------------------------------------------------------------------------
>
> Key: IGNITE-17738
> URL: https://issues.apache.org/jira/browse/IGNITE-17738
> Project: Ignite
> Issue Type: Sub-task
> Reporter: Anton Vinogradov
> Priority: Major
> Labels: iep-31, ise
>
> On cluster restart (because of power-off, OOM or some other problem) it's
> possible to have PDS inconsistent (primary partitions may contain operations
> missed on backups).
> Currently, "historical rebalance" is able to sync the data to the highest LWM
> for every partition.
> Most likely, a primary will be chosen as a rebalance source, but the data
> after the LWM will not be rebalanced. So, all updates between LWM and HWM
> will not be synchronized.
> A possible solution for the case when the cluster failed and restarted (same
> baseline) is to fix counters to help "historical rebalance" perform the sync.
> Counters should be set as
> - HWM at primary and as LWM at backups for caches with 2+ backups,
> - LWM at primary and as HWM at backups for caches with a single backup.
> Possible solutions:
> * This can be implemented as an extension for the "-consistency finalize`
> command, for example `-consistency finalize-on-restart` or
> * Counters can be finalized automatically when cluster composition is equal
> to the baseline specified before the crash (preferred)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)