[ 
https://issues.apache.org/jira/browse/RATIS-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sammi Chen updated RATIS-2315:
------------------------------
    Description: 
When SCM raft reapply during restart, SCMStateMachine#applyTransaction could 
execute 

{code:java}
"applyTransactionFuture.completeExceptionally(ex);"
{code}
 for ContainerStateManagerImpl#addContainer operation, once it fails at 


{code:java}
pipelineManager.addContainerToPipeline(pipelineID, containerID);
{code}

The failure message is likes "Cannot add container to 
pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state". 

This didn't crash the SCM if it happens after SCM has started and running. It 
also did't crash every peer of SCM in the raft group.  The root cause is 
StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot

{code:java}
private void 
checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> futures)
      throws ExecutionException, InterruptedException {
    // check if need to trigger a snapshot
    if (shouldTakeSnapshot()) {
      if (futures.isInitialized()) {
        JavaUtils.allOf(futures.get()).get();
      }

      takeSnapshot();
    }
  }
{code}

When shouldTakeSnapshot() is false, it doesn't care about the futures result. 
When shouldTakeSnapshot is true, if one of futures throws exception, 
checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown the 
raft server in  StateMachineUpdater#run. 

So the behavior when shouldTakeSnapshot false, and true are different. It's 
better have the aligned behavior. The proposal of this JIRA is to ignore the 
ExecutionException exception when shouldTakeSnapshot() is true. 

The above problem is reported and co-analyzed by "Hao Guo". 


  was:
When SCM raft reapply during restart, SCMStateMachine#applyTransaction could 
execute 

{code:java}
"applyTransactionFuture.completeExceptionally(ex);"
{code}
 for ContainerStateManagerImpl#addContainer operation, once it fails at 


{code:java}
pipelineManager.addContainerToPipeline(pipelineID, containerID);
{code}

The failure message is likes "Cannot add container to 
pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state". 

This didn't crash the SCM if it happens after SCM has started and running. It 
also did't crash every peer of SCM in the raft group.  The root cause is 
StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot

{code:java}
private void 
checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> futures)
      throws ExecutionException, InterruptedException {
    // check if need to trigger a snapshot
    if (shouldTakeSnapshot()) {
      if (futures.isInitialized()) {
        JavaUtils.allOf(futures.get()).get();
      }

      takeSnapshot();
    }
  }
{code}

When shouldTakeSnapshot() is false, it doesn't care about the futures result. 
When shouldTakeSnapshot is true, if one of futures throws exception, 
checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown the 
raft server in  StateMachineUpdater#run. 

So the behavior when shouldTakeSnapshot false, and true are different. It's 
better have the aligned behavior. The proposal of this JIRA is to ignore the 
ExecutionException exception when shouldTakeSnapshot() is true. 




> Ignore ExecutionException during take checkpoint check
> ------------------------------------------------------
>
>                 Key: RATIS-2315
>                 URL: https://issues.apache.org/jira/browse/RATIS-2315
>             Project: Ratis
>          Issue Type: Improvement
>            Reporter: Sammi Chen
>            Priority: Major
>
> When SCM raft reapply during restart, SCMStateMachine#applyTransaction could 
> execute 
> {code:java}
> "applyTransactionFuture.completeExceptionally(ex);"
> {code}
>  for ContainerStateManagerImpl#addContainer operation, once it fails at 
> {code:java}
> pipelineManager.addContainerToPipeline(pipelineID, containerID);
> {code}
> The failure message is likes "Cannot add container to 
> pipeline=PipelineID=b2f717d8-3912-424c-b42a-e0b52c305c97 in closed state". 
> This didn't crash the SCM if it happens after SCM has started and running. It 
> also did't crash every peer of SCM in the raft group.  The root cause is 
> StateMachineUpdater#run -> StateMachineUpdater#checkAndTakeSnapshot
> {code:java}
> private void 
> checkAndTakeSnapshot(MemoizedSupplier<List<CompletableFuture<Message>>> 
> futures)
>       throws ExecutionException, InterruptedException {
>     // check if need to trigger a snapshot
>     if (shouldTakeSnapshot()) {
>       if (futures.isInitialized()) {
>         JavaUtils.allOf(futures.get()).get();
>       }
>       takeSnapshot();
>     }
>   }
> {code}
> When shouldTakeSnapshot() is false, it doesn't care about the futures result. 
> When shouldTakeSnapshot is true, if one of futures throws exception, 
> checkAndTakeSnapshot will throws ExecutionException, which in turn shutdown 
> the raft server in  StateMachineUpdater#run. 
> So the behavior when shouldTakeSnapshot false, and true are different. It's 
> better have the aligned behavior. The proposal of this JIRA is to ignore the 
> ExecutionException exception when shouldTakeSnapshot() is true. 
> The above problem is reported and co-analyzed by "Hao Guo". 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to