[ 
https://issues.apache.org/jira/browse/RATIS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze reopened RATIS-2315:
-------------------------------

Let's resolve this as a duplicate of RATIS-2245.

> Ignore ExecutionException during take checkpoint check
> ------------------------------------------------------
>
>                 Key: RATIS-2315
>                 URL: https://issues.apache.org/jira/browse/RATIS-2315
>             Project: Ratis
>          Issue Type: Improvement
>          Components: StateMachine
>            Reporter: Sammi Chen
>            Priority: Major
>             Fix For: 3.2.0
>
>
> When SCM raft reapply during restart, SCMStateMachine#applyTransaction could 
> execute 
> {code:java}
> "applyTransactionFuture.completeExceptionally(ex);"
> {code}
>  for ContainerStateManagerImpl#addContainer operation, once it fails at 
> {code:java}
> pipelineManager.addContainerToPipeline(pipelineID, containerID);
> {code}
> The failure message is likes "Cannot add container to 
> pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state". 
> This didn't crash the SCM if it happens after SCM has started and running. It 
> also did't crash every peer of SCM in the raft group.  The root cause is 
> StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot
> {code:java}
> private void 
> checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> 
> futures)
>       throws ExecutionException, InterruptedException {
>     // check if need to trigger a snapshot
>     if (shouldTakeSnapshot()) {
>       if (futures.isInitialized()) {
>         JavaUtils.allOf(futures.get()).get();
>       }
>       takeSnapshot();
>     }
>   }
> {code}
> When shouldTakeSnapshot() is false, it doesn't care about the futures result. 
> When shouldTakeSnapshot is true, if one of futures throws exception, 
> checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown 
> the raft server in  StateMachineUpdater#run. 
> So the behavior when shouldTakeSnapshot false, and true are different. It's 
> better have the aligned behavior. The proposal of this JIRA is to ignore the 
> ExecutionException exception when shouldTakeSnapshot() is true. 
> The above problem is reported by and co-analyzed with "Hao Guo". 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to