[
https://issues.apache.org/jira/browse/IGNITE-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mikhail Petrov updated IGNITE-17737:
------------------------------------
Fix Version/s: 2.15
> Cluster snapshots may be inconsistent under load.
> --------------------------------------------------
>
> Key: IGNITE-17737
> URL: https://issues.apache.org/jira/browse/IGNITE-17737
> Project: Ignite
> Issue Type: Bug
> Reporter: Nikita Amelchev
> Assignee: Mikhail Petrov
> Priority: Major
> Labels: ise
> Fix For: 2.15
>
> Attachments: SnapshotTest.java
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Cluster snapshots may be inconsistent under load.
> Reproducer:
> One thread does a transactional load: cache#put into transactional cache.
> Another thread does periodic snapshots and checks them.
> Reproducer attached (flacky, please repeat several times).
> Example of a fail:
> {noformat}
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] The check
> procedure has failed, conflict partitions has been found:
> [counterConflicts=1, hashConflicts=1]
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Update counter
> conflicts:
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Conflict
> partition: PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition
> instances: [PartitionHashRecordV2 [isPrimary=false,
> consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING,
> size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false,
> consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING,
> size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false,
> consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING,
> size=19, partHash=1245894112]]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Hash conflicts:
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Conflict
> partition: PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition
> instances: [PartitionHashRecordV2 [isPrimary=false,
> consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING,
> size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false,
> consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING,
> size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false,
> consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING,
> size=19, partHash=1245894112]]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][]
> {noformat}
> The following steps can lead to the following behaviour:
> 1. Client node starts a cache key update operation on the current topology
> version.
> 2. Simultaniosly a snapshot operation is started. It causes a PME (free
> switch) which increments the current minor topology version.
> 3. Node which is primary for the key being updated completes PME locally,
> starts snapshot partitions copy procedure and proceeds with the update
> request ignoring the fact that it was initiated on the stale topology (see
> IGNITE-9558). Therefore, the primary node will not include the updated key in
> the snapshot.
> 4. Backup nodes have not yet completed PME, so the snaphot has not been
> started.
> 5. Backup nodes receive requests to update the key. And since the update
> operation has been mapped to the already completed topology version, backup
> nodes successfully update the key ignoring the fact that PME related to the
> snapshot operation is in progress.
> 6. Backup nodes completes PME and finishes snapshot procedure.
> 7. As a result snapshot from backup nodes includes the updated key and
> primary node does not.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)