[
https://issues.apache.org/jira/browse/IGNITE-23702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mikhail Efremov updated IGNITE-23702:
-------------------------------------
Epic Link: IGNITE-23931 (was: IGNITE-21389)
> Incorrect HB in deferred replica stop on partition restart
> ----------------------------------------------------------
>
> Key: IGNITE-23702
> URL: https://issues.apache.org/jira/browse/IGNITE-23702
> Project: Ignite
> Issue Type: Bug
> Reporter: Denis Chudov
> Priority: Major
> Labels: ignite-3
> Attachments: _Integration_Tests_Integration_CLI_10340.log
>
>
> *Stack trace:*
> {code:java}
> java.lang.AssertionError: Unexpected replica reservation with STOPPING state
> [groupId=18_part_1].
> at
> org.apache.ignite.internal.replicator.ReplicaManager$ReplicaStateManager.reserveReplica(ReplicaManager.java:1560)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.replicator.ReplicaImpl.waitForActualState(ReplicaImpl.java:306)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.replicator.ReplicaImpl.lambda$processLeaseGrantedMessage$6(ReplicaImpl.java:240)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
> ~[?:?]
> at
> java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
> ~[?:?]
> at
> org.apache.ignite.internal.replicator.ReplicaImpl.lambda$processLeaseGrantedMessage$7(ReplicaImpl.java:209)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
> ~[?:?]
> at
> java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
> ~[?:?]
> at
> org.apache.ignite.internal.replicator.ReplicaImpl.processLeaseGrantedMessage(ReplicaImpl.java:209)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.replicator.ReplicaImpl.processPlacementDriverMessage(ReplicaImpl.java:178)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.replicator.ReplicaManager.lambda$onPlacementDriverMessageReceived$7(ReplicaManager.java:539)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> java.base/java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:1106)
> ~[?:?]
> at
> java.base/java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2235)
> ~[?:?]
> at
> org.apache.ignite.internal.replicator.ReplicaManager.onPlacementDriverMessageReceived(ReplicaManager.java:539)
> ~[ignite-replicator-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.network.TrackableNetworkMessageHandler.onReceived(TrackableNetworkMessageHandler.java:52)
> ~[ignite-network-api-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.network.DefaultMessagingService.handleStartingWithFirstHandler(DefaultMessagingService.java:549)
> ~[ignite-network-9.0.127-SNAPSHOT.jar:?]
> at
> org.apache.ignite.internal.network.DefaultMessagingService.lambda$handleMessageFromNetwork$5(DefaultMessagingService.java:440)
> ~[ignite-network-9.0.127-SNAPSHOT.jar:?]
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
> at java.base/java.lang.Thread.run(Thread.java:834) [?:?]{code}
> *Scenario:*
> * Replica should be stopped due to partition restart;
> * replica appears to be a primary one, stop lease prolongation message is
> sent;
> * right after stop lease prolongation handling (9 ms later)
> LeaseGrantMessage is received on the same node where the primary replica was
> located;
> * 1 ms after LeaseGrantMessage is received, replica is stopped (the stop of
> raft node begins);
> * In parallel, the handling of LeaseGrantMessage continues, and the replica
> should be reserved as primary. The state is STOPPED already, and
> reservedForPrimary is true (due to previous primary state), which is
> incorrect, so the assertion is thrown.
> The correct scenario would be: reservedForPrimary becomes false during the
> handling of PRIMARY_REPLICA_EXPIRED event and reserveReplica() returns false
> due to STOPPING state. No error is thrown, replica declines the lease.
> *What needs to be fixed:*
> * missed synchronized block, which should have wrapped stopReplica method in
> ReplicaStateManager#planDeferredReplicaStop
> * incorrect lease expiration waiting: reservation flag may be not removed
> when expiration timestamp comes. This means only that lease is expired, but
> doesn't mean that PRIMARY_REPLICA_EXPIRED was handled, so reservedForPrimary
> is still true and replica stop process still cannot begin. The deferred
> replica stop should wait not for this timestamp, but the event of expiration
> of the lease having the corresponding start time, or election of a new lease
> having the start time greater than expiration timestamp.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)