[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

Stephen O'Donnell (Jira) Wed, 25 May 2022 13:30:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542230#comment-17542230
 ]


Stephen O'Donnell commented on HDFS-16594:
------------------------------------------

Some people have reported good results with the DatanodeAdminBackoffMonitor. 
What about giving it a try and see if the locking it better?

> Many RpcCalls are blocked for a while while Decommission works
> --------------------------------------------------------------
>
>                 Key: HDFS-16594
>                 URL: https://issues.apache.org/jira/browse/HDFS-16594
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.9.2
>            Reporter: JiangHua Zhu
>            Assignee: JiangHua Zhu
>            Priority: Major
>         Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.145.92:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.99.152:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.146.56:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.82.106:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.149.175:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.196.81.46:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

Reply via email to