[ 
https://issues.apache.org/jira/browse/HDDS-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088739#comment-17088739
 ] 

Nanda kumar commented on HDDS-3459:
-----------------------------------

Even though the issue seems different from HDDS-3451, both are caused by the 
same underlying problem.
 We have single thread to process the commands sent by SCM in datanode. If one 
of the command is taking a long time to process or blocks indefinitely then all 
the following commands in the queue will get affected.

In HDDS-3451 we see that close container command 
({{XceiverServerRatis#submitRequest}}) is blocked indefinitely causing the 
issue. In HDDS-3459, the delete block command is blocked waiting for the lock 
on the container causing the issue.

[~yjxxtd], can you share the solution that you're working on?

If we make it an async call or have a thread pool for the command processing, 
we have to consider the order in which the commands are processed. Currently, 
we don't have to worry about the order as the commands are processed by a 
single thread.

Related Jira: https://issues.apache.org/jira/browse/HDDS-3452

> Datanode use a single thread to process the command of scm
> ----------------------------------------------------------
>
>                 Key: HDDS-3459
>                 URL: https://issues.apache.org/jira/browse/HDDS-3459
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: runzhiwang
>            Assignee: runzhiwang
>            Priority: Major
>         Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png, screenshot-5.png, screenshot-6.png, screenshot-7.png
>
>
> *What's the problem ?*
> Datanode use a single [command-processor-thread 
> |https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java#L482]
>  to process the command of scm. Sometimes the thread maybe blocked a very 
> long time, then all the command from scm can not be processed, it maybe cause 
> some problem.
> For example, a group has leader, follower1, follower2, Steps to reproduce the 
> problem are as following:
> 1. Some datanodes crash, and follower2 begin streaming container data to 
> other datanode, then the command-processor-thread was blocked at 
> [cont.writeLock()|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java#L130]
>  when it try to delete block, because streaming container data need to hold 
> the RwLock of container.
> 2. follower2 report close pipeline
> 3. scm send close pipeline command
> 4.  leader and follower1 remove group, but follower2 can not remove group 
> because the command-processor-thread was blocked.
> 5. follower2 then begin LeaderElection  about 12 hours until streaming 
> container data finish and release the RwLock, leader and follower1 response 
> group not found.
> You can see find it in following screenshot.
> 1. follower2 begin streaming container data from 2020-04-17 23:38:39
>  !screenshot-1.png! 
> 2. follower2 report close pipeline 2020-01-48 01:14:39
>  !screenshot-2.png! 
> 3. scm send close pipeline command
>  !screenshot-3.png! 
>  !screenshot-4.png! 
> 4.  leader remove group 
>  !screenshot-5.png! 
>     follower1 remove group
>  !screenshot-6.png! 
> 5. follower2 then begin LeaderElection about 12 hours until  2020-04-18 
> 13:06:20.
>  !screenshot-7.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to