[jira] [Updated] (IGNITE-17737) Cluster snapshots may be inconsistent under load.

Mikhail Petrov (Jira) Tue, 25 Oct 2022 07:52:04 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mikhail Petrov updated IGNITE-17737:
------------------------------------
    Description: 
Cluster snapshots may be inconsistent under load. 

Reproducer:
One thread does a transactional load: cache#put into transactional cache.
Another thread does periodic snapshots and checks them.

Reproducer attached (flacky, please repeat several times).

Example of a fail:


{noformat}
[2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] The check procedure 
has failed, conflict partitions has been found: [counterConflicts=1, 
hashConflicts=1]

[2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Update counter 
conflicts:

[2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Conflict partition: 
PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition 
instances: [PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING, 
size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING, 
size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING, 
size=19, partHash=1245894112]]

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] 

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Hash conflicts:

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Conflict partition: 
PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition 
instances: [PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING, 
size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING, 
size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING, 
size=19, partHash=1245894112]]

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] 
{noformat}
The following steps can lead to the following behaviour:
1. Client node starts a cache key update operation on the current topology 
version.
2. Simultaniosly a snapshot operation is started. It causes a PME (free switch) 
which increments the current minor topology version.
3. Node which is primary for the key being updated completes PME locally, 
starts snapshot partitions copy procedure and proceeds with the update request 
ignoring the fact that it was initiated on the stale topology (see 
IGNITE-9558). Therefore, the primary node did not include the updated key in 
the snapshot.  
4. Backup nodes have not yet completed PME, so the snaphot has not been 
started. 
5. Backup nodes receive requests to update the key. And since the update 
operation has been mapped to the already completed topology version, backup 
nodes successfully update the key ignoring the fact that PME related to the 
snapshot operation is in progress.
6. Backup nodes completes PME and finishes snapshot procedure.
7. As a result snapshot from backup nodes includes the updated key and primary 
node does not.


  was:
Cluster snapshots may be inconsistent under load. 

Reproducer:
One thread does a transactional load: cache#put into transactional cache.
Another thread does periodic snapshots and checks them.

Reproducer attached (flacky, please repeat several times).

Example of a fail:


{noformat}
[2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] The check procedure 
has failed, conflict partitions has been found: [counterConflicts=1, 
hashConflicts=1]

[2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Update counter 
conflicts:

[2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Conflict partition: 
PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition 
instances: [PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING, 
size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING, 
size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING, 
size=19, partHash=1245894112]]

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] 

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Hash conflicts:

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Conflict partition: 
PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition 
instances: [PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING, 
size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING, 
size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false, 
consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING, 
size=19, partHash=1245894112]]

[2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] 
{noformat}



> Cluster snapshots may be inconsistent under load. 
> --------------------------------------------------
>
>                 Key: IGNITE-17737
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17737
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Nikita Amelchev
>            Assignee: Mikhail Petrov
>            Priority: Major
>              Labels: ise
>         Attachments: SnapshotTest.java
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Cluster snapshots may be inconsistent under load. 
> Reproducer:
> One thread does a transactional load: cache#put into transactional cache.
> Another thread does periodic snapshots and checks them.
> Reproducer attached (flacky, please repeat several times).
> Example of a fail:
> {noformat}
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] The check 
> procedure has failed, conflict partitions has been found: 
> [counterConflicts=1, hashConflicts=1]
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Update counter 
> conflicts:
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Conflict 
> partition: PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition 
> instances: [PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING, 
> size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING, 
> size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING, 
> size=19, partHash=1245894112]]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] 
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Hash conflicts:
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Conflict 
> partition: PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition 
> instances: [PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING, 
> size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING, 
> size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false, 
> consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING, 
> size=19, partHash=1245894112]]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] 
> {noformat}
> The following steps can lead to the following behaviour:
> 1. Client node starts a cache key update operation on the current topology 
> version.
> 2. Simultaniosly a snapshot operation is started. It causes a PME (free 
> switch) which increments the current minor topology version.
> 3. Node which is primary for the key being updated completes PME locally, 
> starts snapshot partitions copy procedure and proceeds with the update 
> request ignoring the fact that it was initiated on the stale topology (see 
> IGNITE-9558). Therefore, the primary node did not include the updated key in 
> the snapshot.  
> 4. Backup nodes have not yet completed PME, so the snaphot has not been 
> started. 
> 5. Backup nodes receive requests to update the key. And since the update 
> operation has been mapped to the already completed topology version, backup 
> nodes successfully update the key ignoring the fact that PME related to the 
> snapshot operation is in progress.
> 6. Backup nodes completes PME and finishes snapshot procedure.
> 7. As a result snapshot from backup nodes includes the updated key and 
> primary node does not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (IGNITE-17737) Cluster snapshots may be inconsistent under load.

Reply via email to