[ 
https://issues.apache.org/jira/browse/HDDS-7738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duong updated HDDS-7738:
------------------------
    Description: 
This is similar to HDDS-5843, but in a different scenario.

 

An Ozone customer encountered this issue after a container (c1) is allocated 
with a newly created pipeline (p1). The chain of events is as follows:
 # SCM processes pipeline creation transaction *p1* => *p1* is {*}created{*}.
 # SCM received a request to close p1 from a data node (see the previous 
comment)
=> *p1* is {*}closed{*}.
=> SCM also tried to find and close relevant containers, at this point, 
container *c1* doesn't *exist* yet, so it {*}can't be closed{*}.
 # SCM processes the container *c1* allocation transaction => failed because 
*p1* is *closed* already.
=> SCM terminates and both transactions #1 and #3 are not committed (as Ratis 
commits transactions in chunks).

Because the transactions are not committed, whenever SCM restarts, it got 
through the same step #1 and #3 and terminates again.

Solution: SCM should not terminate when adding a container with a closed 
pipeline. The fix is similar to HDDS-5843.

 

Stacktrace:
{code:java}
2022-12-28 11:53:20,465  ERROR org.apache.ratis.statemachine.StateMachine: 
Terminating with exit status 1: Cannot add container to 
pipeline=PipelineID=7f97fc6a-c31a-4978-be3b-e38af7cd023f in closed state
java.io.IOException: Cannot add container to 
pipeline=PipelineID=7f97fc6a-c31a-4978-be3b-e38af7cd023f in closed state
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.addContainerToPipeline(PipelineStateMap.java:110)
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.addContainerToPipeline(PipelineStateManagerImpl.java:114)
        at jdk.internal.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:87)
        at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:72)
        at com.sun.proxy.$Proxy17.addContainerToPipeline(Unknown Source)
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.addContainerToPipeline(PipelineManagerImpl.java:327)
        at 
org.apache.hadoop.hdds.scm.container.ContainerStateManagerImpl.lambda$addContainer$1(ContainerStateManagerImpl.java:309)
        at 
org.apache.hadoop.hdds.scm.ha.ExecutionUtil.execute(ExecutionUtil.java:59)
        at 
org.apache.hadoop.hdds.scm.container.ContainerStateManagerImpl.addContainer(ContainerStateManagerImpl.java:321)
        at jdk.internal.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:168)
        at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:139)
        at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1588)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:239)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:182)
        at java.base/java.lang.Thread.run(Thread.java:829){code}

  was:
This is similar to HDDS-5843, but in a different scenario.

 

An Ozone customer encountered this issue after a container (c1) is allocated 
with a newly created pipeline (p1). The chain of events is as follows:
 # SCM processes pipeline creation transaction *p1* => *p1* is {*}created{*}.
 # SCM received a request to close p1 from a data node (see the previous 
comment)
=> *p1* is {*}closed{*}.
=> SCM also tried to find and close relevant containers, at this point, 
container *c1* doesn't *exist* yet, so it {*}can't be closed{*}.
 # SCM processes the container *c1* allocation transaction => failed because 
*p1* is *closed* already.
=> SCM terminates and both transactions #1 and #3 are not committed (as Ratis 
commits transactions in chunks).

Because the transactions are not committed, whenever SCM restarts, it got 
through the same step #1 and #3 and terminates again.

Solution: SCM should not terminate when adding a container with a closed 
pipeline. The fix is similar to HDDS-5843.


> SCM terminates when adding container to a closed pipeline
> ---------------------------------------------------------
>
>                 Key: HDDS-7738
>                 URL: https://issues.apache.org/jira/browse/HDDS-7738
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Duong
>            Assignee: Duong
>            Priority: Critical
>              Labels: pull-request-available
>
> This is similar to HDDS-5843, but in a different scenario.
>  
> An Ozone customer encountered this issue after a container (c1) is allocated 
> with a newly created pipeline (p1). The chain of events is as follows:
>  # SCM processes pipeline creation transaction *p1* => *p1* is {*}created{*}.
>  # SCM received a request to close p1 from a data node (see the previous 
> comment)
> => *p1* is {*}closed{*}.
> => SCM also tried to find and close relevant containers, at this point, 
> container *c1* doesn't *exist* yet, so it {*}can't be closed{*}.
>  # SCM processes the container *c1* allocation transaction => failed because 
> *p1* is *closed* already.
> => SCM terminates and both transactions #1 and #3 are not committed (as Ratis 
> commits transactions in chunks).
> Because the transactions are not committed, whenever SCM restarts, it got 
> through the same step #1 and #3 and terminates again.
> Solution: SCM should not terminate when adding a container with a closed 
> pipeline. The fix is similar to HDDS-5843.
>  
> Stacktrace:
> {code:java}
> 2022-12-28 11:53:20,465  ERROR org.apache.ratis.statemachine.StateMachine: 
> Terminating with exit status 1: Cannot add container to 
> pipeline=PipelineID=7f97fc6a-c31a-4978-be3b-e38af7cd023f in closed state
> java.io.IOException: Cannot add container to 
> pipeline=PipelineID=7f97fc6a-c31a-4978-be3b-e38af7cd023f in closed state
>       at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateMap.addContainerToPipeline(PipelineStateMap.java:110)
>       at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineStateManagerImpl.addContainerToPipeline(PipelineStateManagerImpl.java:114)
>       at jdk.internal.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
>       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>       at 
> org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeLocal(SCMHAInvocationHandler.java:87)
>       at 
> org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:72)
>       at com.sun.proxy.$Proxy17.addContainerToPipeline(Unknown Source)
>       at 
> org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.addContainerToPipeline(PipelineManagerImpl.java:327)
>       at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManagerImpl.lambda$addContainer$1(ContainerStateManagerImpl.java:309)
>       at 
> org.apache.hadoop.hdds.scm.ha.ExecutionUtil.execute(ExecutionUtil.java:59)
>       at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManagerImpl.addContainer(ContainerStateManagerImpl.java:321)
>       at jdk.internal.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>       at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>       at 
> org.apache.hadoop.hdds.scm.ha.SCMStateMachine.process(SCMStateMachine.java:168)
>       at 
> org.apache.hadoop.hdds.scm.ha.SCMStateMachine.applyTransaction(SCMStateMachine.java:139)
>       at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1588)
>       at 
> org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:239)
>       at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:182)
>       at java.base/java.lang.Thread.run(Thread.java:829){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to