[
https://issues.apache.org/jira/browse/HDDS-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sumit Agrawal resolved HDDS-3459.
---------------------------------
Resolution: Duplicate
With multiple issue handling, DN have separate queue and threadpool to handle
the commands at DN. This problem is already fixed. few are:
[HDDS-9122. Make the Datanode CloseContainerCommandHandler async by queuing
commands in an executor|https://github.com/apache/ozone/pull/5277]
HDDS-5360. DN failed to process all delete block commands in one heartbeat
interva
...
> Datanode use a single thread to process the command of scm
> ----------------------------------------------------------
>
> Key: HDDS-3459
> URL: https://issues.apache.org/jira/browse/HDDS-3459
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: runzhiwang
> Assignee: runzhiwang
> Priority: Major
> Labels: performance
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png,
> screenshot-4.png, screenshot-5.png, screenshot-6.png, screenshot-7.png
>
>
> *What's the problem ?*
> Datanode use a single [command-processor-thread
> |https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java#L482]
> to process the command of scm. Sometimes the thread maybe blocked a very
> long time, then all the command from scm can not be processed, it maybe cause
> some problem.
> For example, a group has leader, follower1, follower2, Steps to reproduce the
> problem are as following:
> 1. Some datanodes crash, and follower2 begin streaming container data to
> other datanode, then the command-processor-thread was blocked at
> [cont.writeLock()|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/DeleteBlocksCommandHandler.java#L130]
> when it try to delete block, because streaming container data need to hold
> the RwLock of container.
> 2. follower2 report close pipeline
> 3. scm send close pipeline command
> 4. leader and follower1 remove group, but follower2 can not remove group
> because the command-processor-thread was blocked.
> 5. follower2 then begin LeaderElection about 12 hours until streaming
> container data finish and release the RwLock, leader and follower1 response
> group not found.
> You can see find it in following screenshot.
> 1. follower2 begin streaming container data from 2020-04-17 23:38:39
> !screenshot-1.png!
> 2. follower2 report close pipeline 2020-01-48 01:14:39
> !screenshot-2.png!
> 3. scm send close pipeline command
> !screenshot-3.png!
> !screenshot-4.png!
> 4. leader remove group
> !screenshot-5.png!
> follower1 remove group
> !screenshot-6.png!
> 5. follower2 then begin LeaderElection about 12 hours until 2020-04-18
> 13:06:20.
> !screenshot-7.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]