[ https://issues.apache.org/jira/browse/IGNITE-26168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mikhail Petrov updated IGNITE-26168: ------------------------------------ Ignite Flags: Release Notes Required (was: Docs Required,Release Notes Required) > Enhance partition loss detection between cluster restarts > ---------------------------------------------------------- > > Key: IGNITE-26168 > URL: https://issues.apache.org/jira/browse/IGNITE-26168 > Project: Ignite > Issue Type: Task > Reporter: Mikhail Petrov > Assignee: Mikhail Petrov > Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > The problem based on real case scenario: > 1. Cluster with PDS enabled is deactivated and stopped gracefully. > 2. Some physical servers are replaced with their PDS being cleared during > maintenance (this may also be done unintentionally or due to some hardware > issues) > 3. The replaced servers represent all primary and backups nodes for some > partitions (cell). As a result the data is lost. > 4. Cluster is restarted. > 5. Idle verify procedure completes successfully. > 6. Cluster is activated successfully. > As a result, Ignite successfully continues its work after restart. But some > of the data just disappeared. Ignite users do not see warnings, and data loss > may be detected accidentally after a while. > The described situation can be safely resolved by replacing the nodes one by > one and waiting for the rebalancing to complete. > But as mentioned in clause 2 PDS data can be lost for different reasons. > Currently, Ignite supports mechanism for detecting lost partitions, which is > designed to restrict cache operations in case some cache partitions are lost > (due to node leaving or failure). But its behaviour is not consistent between > cluster restarts/activation and deactivation. > Consider cluster with PDS enabled. The following list shows possible > scenarious when all partitions owners(parimary and backups) leave the cluster. > 1. activation -> cell left -> lost parts > 2. activation -> cell left -> cell joined -> lost parts > 3. activation -> cell left -> deactivation -> cell joined -> activation -> > ignored > 4. activation -> cell left -> cell joined -> deactivation -> activation -> > lost parts > 5. activation -> cell left -> deactivation -> activation -> cell joined -> > lost parts > 6. deactivation -> cell left -> cell joined -> activation -> ignored > 7. deactivation -> cell left -> activation -> cell joined -> lost parts > cell - node group that stores all primary and backup partitions. Can be > configured via ClusterNodeAttributeColocatedBackupFilter > lost parts - ignite detected lost partitions. Cache operations are > restricted according to policy > ignored - no partition loss is detected. if cell nodes join the cluster > with PDS data cleared, ignite will not detect partitions loss - it just > recreates missed partitions > deactivation - you can also consider a cluster stop after deactivation and > cluster start before activation > It is proposed to fix Ignite to detect local partitions for clauses 3 and 6. > Note that we are considering only case when cluster is stopped gracefully. > The main idea - > 1. During PME caused by deactivation, aggregate on coordinator partition info > and list of lost partitions from all nodes. > 2. Distribute aggregated information using PME Full Message and store it in > each node's local metastorage. > 3. During activation use stored info to detect lost partitions. If some > partitions has zero update counters in received single messages, but > according to saved partition info they were updated - mark them as lost. > Partition Info includes a list of partition IDs that were not > initialized(update counter == 0, it`s crucial because currently Ignite can't > distinguish between a partition not being updated at all or being deleted > between restarts) and list of partition IDs that were marked as lost at the > time of deactivation. -- This message was sent by Atlassian Jira (v8.20.10#820010)