[ https://issues.apache.org/jira/browse/RATIS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz-wo Sze reopened RATIS-2315: ------------------------------- Let's resolve this as a duplicate of RATIS-2245. > Ignore ExecutionException during take checkpoint check > ------------------------------------------------------ > > Key: RATIS-2315 > URL: https://issues.apache.org/jira/browse/RATIS-2315 > Project: Ratis > Issue Type: Improvement > Components: StateMachine > Reporter: Sammi Chen > Priority: Major > Fix For: 3.2.0 > > > When SCM raft reapply during restart, SCMStateMachine#applyTransaction could > execute > {code:java} > "applyTransactionFuture.completeExceptionally(ex);" > {code} > for ContainerStateManagerImpl#addContainer operation, once it fails at > {code:java} > pipelineManager.addContainerToPipeline(pipelineID, containerID); > {code} > The failure message is likes "Cannot add container to > pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state". > This didn't crash the SCM if it happens after SCM has started and running. It > also did't crash every peer of SCM in the raft group. The root cause is > StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot > {code:java} > private void > checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> > futures) > throws ExecutionException, InterruptedException { > // check if need to trigger a snapshot > if (shouldTakeSnapshot()) { > if (futures.isInitialized()) { > JavaUtils.allOf(futures.get()).get(); > } > takeSnapshot(); > } > } > {code} > When shouldTakeSnapshot() is false, it doesn't care about the futures result. > When shouldTakeSnapshot is true, if one of futures throws exception, > checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown > the raft server in StateMachineUpdater#run. > So the behavior when shouldTakeSnapshot false, and true are different. It's > better have the aligned behavior. The proposal of this JIRA is to ignore the > ExecutionException exception when shouldTakeSnapshot() is true. > The above problem is reported by and co-analyzed with "Hao Guo". -- This message was sent by Atlassian Jira (v8.20.10#820010)