[
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542230#comment-17542230
]
Stephen O'Donnell commented on HDFS-16594:
------------------------------------------
Some people have reported good results with the DatanodeAdminBackoffMonitor.
What about giving it a try and see if the locking it better?
> Many RpcCalls are blocked for a while while Decommission works
> --------------------------------------------------------------
>
> Key: HDFS-16594
> URL: https://issues.apache.org/jira/browse/HDFS-16594
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Affects Versions: 2.9.2
> Reporter: JiangHua Zhu
> Assignee: JiangHua Zhu
> Priority: Major
> Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to
> work, and periodically checks the number of blocks remaining to be processed.
> By default, when checking more than
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number
> of replicas managed by the DataNode reaches 90w or even 100w, during this
> period, the DatanodeAdminDefaultMonitor will continue to hold the
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
> !image-2022-05-26-02-05-38-878.png!
> It can be seen that in the last inspection process, there were more than 100w
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 36 on
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process
> from client Call#5571549 Retry#0
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from
> 10.196.145.92:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 135 on
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process
> from client Call#36795561 Retry#0
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from
> 10.196.99.152:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 108 on
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process
> from client Call#5497586 Retry#0
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from
> 10.196.146.56:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 33 on
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process
> from client Call#6043903 Retry#0
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from
> 10.196.82.106:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 139 on
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process
> from client Call#274471 Retry#0
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from
> 10.196.149.175:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN [IPC Server handler 77 on
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process
> from client Call#73375524 Retry#0
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from
> 10.196.81.46:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will
> be longer than usual. A very large number of RpcCalls were affected during
> this time.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]