[ 
https://issues.apache.org/jira/browse/HDDS-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4630:
----------------------------
    Description: 
This dead lock is found when trying to replace the MockRatisServer with single 
server SCMRatisServer in MiniOzoneCluster.

It can be reproduced by case 
TestContainerStateMachineFlushDelay#testContainerStateMachineFailures, when 
replacing the mock ratis server with the real one.

The root cause is, when close a pipeline, it will first close the open 
containers of this pipeline, then remove the pipeline.

The contention here is:
 # ContainerManager has committed the log entry that containing 
updateContainerState, and the StateMachineUpdater is applying this method, 
waiting for the lock of PipelineManagerV2Impl. Since when a container 
transitions from open to un-open, it needs to call 
PipelineManager#removeContainerFromPipeline, thus need the lock of 
PipelineManagerV2Impl.
 # 

and is applying  

  was:
This dead lock is found when trying to replace the MockRatisServer with single 
server SCMRatisServer in MiniOzoneCluster.

It can be reproduced by case 
TestContainerStateMachineFlushDelay#testContainerStateMachineFailures, by 
replacing the mock ratis server with the real one.


> Solve dead lock when PipelineActionHandler is triggered.
> --------------------------------------------------------
>
>                 Key: HDDS-4630
>                 URL: https://issues.apache.org/jira/browse/HDDS-4630
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>          Components: SCM HA
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Major
>         Attachments: PipelineActionHander 1.png, PipelineActionHander 2.png, 
> StateMachineUpdater 1.png, StateMachineUpdater 2.png
>
>
> This dead lock is found when trying to replace the MockRatisServer with 
> single server SCMRatisServer in MiniOzoneCluster.
> It can be reproduced by case 
> TestContainerStateMachineFlushDelay#testContainerStateMachineFailures, when 
> replacing the mock ratis server with the real one.
> The root cause is, when close a pipeline, it will first close the open 
> containers of this pipeline, then remove the pipeline.
> The contention here is:
>  # ContainerManager has committed the log entry that containing 
> updateContainerState, and the StateMachineUpdater is applying this method, 
> waiting for the lock of PipelineManagerV2Impl. Since when a container 
> transitions from open to un-open, it needs to call 
> PipelineManager#removeContainerFromPipeline, thus need the lock of 
> PipelineManagerV2Impl.
>  # 
> and is applying  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to