[ 
https://issues.apache.org/jira/browse/HDDS-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

runzhiwang updated HDDS-3459:
-----------------------------
    Description: 
*What's the problem ?*
Datanode use a single [command-processor-thread 
|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java#L482]
 to process the command of scm. Sometimes the thread maybe blocked a very long 
time, then all the command from scm can not be processed, it maybe cause some 
problem.

For example, a group has leader, follower1, follower2, Steps to reproduce the 
problem are as following:
1. Some datanodes crash, and follower2 begin streaming container data to other 
datanode, then the command-processor-thread was blocked at 
[cont.writeLock()|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java#L130]
 when it try to delete block, because streaming container data need to hold the 
RwLock of container.
2. follower2 report close pipeline
3. scm send close pipeline command
4.  leader and follower1 remove group, but follower2 can not remove group 
because the command-processor-thread was blocked.
5. follower2 then begin infinite LeaderElection  about 13 hours, leader and 
follower1 response group not found.

You can see find it in following screenshot.
1. follower2 begin streaming container data





  was:
*What's the problem ?*
Datanode use a single [command-processor-thread 
|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java#L482]
 to process the command of scm. Sometimes the thread maybe blocked a very long 
time, then all the command from scm can not be processed, it maybe cause some 
problem.

For example, a group has leader, follower1, follower2, Steps to reproduce the 
problem are as following:
1. Some datanodes crash, and follower2 begin streaming container data to other 
datanode, then the command-processor-thread was blocked at 
[cont.writeLock()|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java#L130],
 because streaming container data need to hold the RwLock of container.
2. follower2 report close pipeline
3. scm send close pipeline command
4.  leader and follower1 remove group, but follower2 can not remove group 
because the command-processor-thread was blocked.
5. follower2 then begin infinite LeaderElection  about 13 hours, leader and 
follower1 response group not found.

You can see find it in following screenshot.
1. follower2 begin streaming container data





> Datanode use a single thread to process the command of scm
> ----------------------------------------------------------
>
>                 Key: HDDS-3459
>                 URL: https://issues.apache.org/jira/browse/HDDS-3459
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Major
>         Attachments: screenshot-1.png
>
>
> *What's the problem ?*
> Datanode use a single [command-processor-thread 
> |https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java#L482]
>  to process the command of scm. Sometimes the thread maybe blocked a very 
> long time, then all the command from scm can not be processed, it maybe cause 
> some problem.
> For example, a group has leader, follower1, follower2, Steps to reproduce the 
> problem are as following:
> 1. Some datanodes crash, and follower2 begin streaming container data to 
> other datanode, then the command-processor-thread was blocked at 
> [cont.writeLock()|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java#L130]
>  when it try to delete block, because streaming container data need to hold 
> the RwLock of container.
> 2. follower2 report close pipeline
> 3. scm send close pipeline command
> 4.  leader and follower1 remove group, but follower2 can not remove group 
> because the command-processor-thread was blocked.
> 5. follower2 then begin infinite LeaderElection  about 13 hours, leader and 
> follower1 response group not found.
> You can see find it in following screenshot.
> 1. follower2 begin streaming container data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to