[jira] [Updated] (IGNITE-17737) Cluster snapshots may be inconsistent under load.

Mikhail Petrov (Jira) Tue, 22 Nov 2022 05:23:12 -0800


     [ 
https://issues.apache.org/jira/browse/IGNITE-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mikhail Petrov updated IGNITE-17737:
------------------------------------
    Release Note: Fixed snapshot inconsistency if it was taken under cache 
workload.

> Cluster snapshots may be inconsistent under load. 
> --------------------------------------------------
>
>                 Key: IGNITE-17737
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17737
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Nikita Amelchev
>            Assignee: Mikhail Petrov
>            Priority: Major
>              Labels: ise
>         Attachments: SnapshotTest.java
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Cluster snapshots may be inconsistent under load. 
> Reproducer:
> One thread does a transactional load: cache#put into transactional cache.
> Another thread does periodic snapshots and checks them.
> Reproducer attached (flacky, please repeat several times).
> Example of a fail:
> {noformat}
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] The check 
> procedure has failed, conflict partitions has been found: 
> [counterConflicts=1, hashConflicts=1]
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Update counter 
> conflicts:
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Conflict 
> partition: PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition 
> instances: [PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING, 
> size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING, 
> size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING, 
> size=19, partHash=1245894112]]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] 
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Hash conflicts:
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Conflict 
> partition: PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition 
> instances: [PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING, 
> size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING, 
> size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING, 
> size=19, partHash=1245894112]]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] 
> {noformat}
> The following steps can lead to the following behaviour:
> 1. Client node starts a cache key update operation on the current topology 
> version.
> 2. Simultaniosly a snapshot operation is started. It causes a PME (free 
> switch) which increments the current minor topology version.
> 3. Node which is primary for the key being updated completes PME locally, 
> starts snapshot partitions copy procedure and proceeds with the update 
> request ignoring the fact that it was initiated on the stale topology (see 
> IGNITE-9558). Therefore, the primary node will not include the updated key in 
> the snapshot.  
> 4. Backup nodes have not yet completed PME, so the snaphot has not been 
> started. 
> 5. Backup nodes receive requests to update the key. And since the update 
> operation has been mapped to the already completed topology version, backup 
> nodes successfully update the key ignoring the fact that PME related to the 
> snapshot operation is in progress.
> 6. Backup nodes completes PME and finishes snapshot procedure.
> 7. As a result snapshot from backup nodes includes the updated key and 
> primary node does not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17737) Cluster snapshots may be inconsistent under load.

Reply via email to