[ https://issues.apache.org/jira/browse/RATIS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sammi Chen updated RATIS-2315: ------------------------------ Description: When SCM raft reapply during restart, SCMStateMachine#applyTransaction could execute {code:java} "applyTransactionFuture.completeExceptionally(ex);" {code} for ContainerStateManagerImpl#addContainer operation, once it fails at {code:java} pipelineManager.addContainerToPipeline(pipelineID, containerID); {code} The failure message is likes "Cannot add container to pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state". This didn't crash the SCM if it happens after SCM has started and running. It also did't crash every peer of SCM in the raft group. The root cause is StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot {code:java} private void checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> futures) throws ExecutionException, InterruptedException { // check if need to trigger a snapshot if (shouldTakeSnapshot()) { if (futures.isInitialized()) { JavaUtils.allOf(futures.get()).get(); } takeSnapshot(); } } {code} When shouldTakeSnapshot() is false, it doesn't care about the futures result. When shouldTakeSnapshot is true, if one of futures throws exception, checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown the raft server in StateMachineUpdater#run. So the behavior when shouldTakeSnapshot false, and true are different. It's better have the aligned behavior. The proposal of this JIRA is to ignore the ExecutionException exception when shouldTakeSnapshot() is true. The above problem is reported and co-analyzed by "Hao Guo". was: When SCM raft reapply during restart, SCMStateMachine#applyTransaction could execute {code:java} "applyTransactionFuture.completeExceptionally(ex);" {code} for ContainerStateManagerImpl#addContainer operation, once it fails at {code:java} pipelineManager.addContainerToPipeline(pipelineID, containerID); {code} The failure message is likes "Cannot add container to pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state". This didn't crash the SCM if it happens after SCM has started and running. It also did't crash every peer of SCM in the raft group. The root cause is StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot {code:java} private void checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> futures) throws ExecutionException, InterruptedException { // check if need to trigger a snapshot if (shouldTakeSnapshot()) { if (futures.isInitialized()) { JavaUtils.allOf(futures.get()).get(); } takeSnapshot(); } } {code} When shouldTakeSnapshot() is false, it doesn't care about the futures result. When shouldTakeSnapshot is true, if one of futures throws exception, checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown the raft server in StateMachineUpdater#run. So the behavior when shouldTakeSnapshot false, and true are different. It's better have the aligned behavior. The proposal of this JIRA is to ignore the ExecutionException exception when shouldTakeSnapshot() is true. > Ignore ExecutionException during take checkpoint check > ------------------------------------------------------ > > Key: RATIS-2315 > URL: https://issues.apache.org/jira/browse/RATIS-2315 > Project: Ratis > Issue Type: Improvement > Reporter: Sammi Chen > Priority: Major > > When SCM raft reapply during restart, SCMStateMachine#applyTransaction could > execute > {code:java} > "applyTransactionFuture.completeExceptionally(ex);" > {code} > for ContainerStateManagerImpl#addContainer operation, once it fails at > {code:java} > pipelineManager.addContainerToPipeline(pipelineID, containerID); > {code} > The failure message is likes "Cannot add container to > pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state". > This didn't crash the SCM if it happens after SCM has started and running. It > also did't crash every peer of SCM in the raft group. The root cause is > StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot > {code:java} > private void > checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> > futures) > throws ExecutionException, InterruptedException { > // check if need to trigger a snapshot > if (shouldTakeSnapshot()) { > if (futures.isInitialized()) { > JavaUtils.allOf(futures.get()).get(); > } > takeSnapshot(); > } > } > {code} > When shouldTakeSnapshot() is false, it doesn't care about the futures result. > When shouldTakeSnapshot is true, if one of futures throws exception, > checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown > the raft server in StateMachineUpdater#run. > So the behavior when shouldTakeSnapshot false, and true are different. It's > better have the aligned behavior. The proposal of this JIRA is to ignore the > ExecutionException exception when shouldTakeSnapshot() is true. > The above problem is reported and co-analyzed by "Hao Guo". -- This message was sent by Atlassian Jira (v8.20.10#820010)