[jira] [Comment Edited] (HDDS-3459) Datanode use a single thread to process the command of scm

runzhiwang (Jira) Tue, 21 Apr 2020 07:06:27 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088721#comment-17088721
 ]


runzhiwang edited comment on HDDS-3459 at 4/21/20, 2:05 PM:
------------------------------------------------------------

[~msingh] I think it's not similar to   HDDS-3451.  I think we can use 
different threads for different types of command. Such as closePipline should 
not be blocked by deleteBlock.


was (Author: yjxxtd):
[~msingh] I think it's not similar to   HDDS-3451. 

> Datanode use a single thread to process the command of scm
> ----------------------------------------------------------
>
>                 Key: HDDS-3459
>                 URL: https://issues.apache.org/jira/browse/HDDS-3459
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Major
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png, screenshot-5.png, screenshot-6.png, screenshot-7.png
>
>
> *What's the problem ?*
> Datanode use a single [command-processor-thread 
> |https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java#L482]
>  to process the command of scm. Sometimes the thread maybe blocked a very 
> long time, then all the command from scm can not be processed, it maybe cause 
> some problem.
> For example, a group has leader, follower1, follower2, Steps to reproduce the 
> problem are as following:
> 1. Some datanodes crash, and follower2 begin streaming container data to 
> other datanode, then the command-processor-thread was blocked at 
> [cont.writeLock()|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java#L130]
>  when it try to delete block, because streaming container data need to hold 
> the RwLock of container.
> 2. follower2 report close pipeline
> 3. scm send close pipeline command
> 4.  leader and follower1 remove group, but follower2 can not remove group 
> because the command-processor-thread was blocked.
> 5. follower2 then begin LeaderElection  about 12 hours, leader and follower1 
> response group not found.
> You can see find it in following screenshot.
> 1. follower2 begin streaming container data from 2020-04-17 23:38:39
>  !screenshot-1.png! 
> 2. follower2 report close pipeline 2020-01-48 01:14:39
>  !screenshot-2.png! 
> 3. scm send close pipeline command
>  !screenshot-3.png! 
>  !screenshot-4.png! 
> 4.  leader remove group 
>  !screenshot-5.png! 
>     follower1 remove group
>  !screenshot-6.png! 
> 5. follower2 then begin LeaderElection about 12 hours until  2020-04-18 
> 13:06:20.
>  !screenshot-7.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-3459) Datanode use a single thread to process the command of scm

Reply via email to