[jira] [Updated] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

JiangHua Zhu (Jira) Wed, 25 May 2022 18:53:06 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


JiangHua Zhu updated HDFS-16594:
--------------------------------
    Description: 
When there are some DataNodes that need to go offline, Decommission starts to 
work, and periodically checks the number of blocks remaining to be processed. 
By default, when checking more than 
50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
If the number of blocks to be checked is very large, for example, the number of 
replicas managed by the DataNode reaches 90w or even 100w, during this period, 
the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, 
which will block a lot of RpcCalls. Here are some logs:
 !image-2022-05-26-02-05-38-878.png! 

It can be seen that in the last inspection process, there were more than 100w 
blocks.
When the check is over, FSNamesystemLock is released and RpcCall starts working:
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
from client Call#5571549 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
xxxx.xxxx.xxxx.xxxx:35727
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
from client Call#36795561 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
xxxx.xxxx.xxxx.xxxx:37793
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
from client Call#5497586 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
xxxx.xxxx.xxxx.xxxx:23475
'
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
from client Call#6043903 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
xxxx.xxxx.xxxx.xxxx:34746
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#274471 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
xxxx.xxxx.xxxx.xxxx:46419
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#73375524 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
xxxx.xxxx.xxxx.xxxx:34241
'
Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
be longer than usual. A very large number of RpcCalls were affected during this 
time.

  was:
When there are some DataNodes that need to go offline, Decommission starts to 
work, and periodically checks the number of blocks remaining to be processed. 
By default, when checking more than 
50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
If the number of blocks to be checked is very large, for example, the number of 
replicas managed by the DataNode reaches 90w or even 100w, during this period, 
the DatanodeAdminDefaultMonitor will continue to hold the FSNamesystemLock, 
which will block a lot of RpcCalls. Here are some logs:
 !image-2022-05-26-02-05-38-878.png! 

It can be seen that in the last inspection process, there were more than 100w 
blocks.
When the check is over, FSNamesystemLock is released and RpcCall starts working:
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
from client Call#5571549 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.145.92:35727
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
from client Call#36795561 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.99.152:37793
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
from client Call#5497586 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.146.56:23475
'
'
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
from client Call#6043903 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.82.106:34746
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#274471 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.149.175:46419
2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
from client Call#73375524 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
10.196.81.46:34241
'
Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
be longer than usual. A very large number of RpcCalls were affected during this 
time.


> Many RpcCalls are blocked for a while while Decommission works
> --------------------------------------------------------------
>
>                 Key: HDFS-16594
>                 URL: https://issues.apache.org/jira/browse/HDFS-16594
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.9.2
>            Reporter: JiangHua Zhu
>            Assignee: JiangHua Zhu
>            Priority: Major
>         Attachments: image-2022-05-26-02-05-38-878.png
>
>
> When there are some DataNodes that need to go offline, Decommission starts to 
> work, and periodically checks the number of blocks remaining to be processed. 
> By default, when checking more than 
> 50w(${dfs.namenode.decommission.blocks.per.interval}) blocks, the 
> DatanodeAdminDefaultMonitor thread will sleep for a while before continuing.
> If the number of blocks to be checked is very large, for example, the number 
> of replicas managed by the DataNode reaches 90w or even 100w, during this 
> period, the DatanodeAdminDefaultMonitor will continue to hold the 
> FSNamesystemLock, which will block a lot of RpcCalls. Here are some logs:
>  !image-2022-05-26-02-05-38-878.png! 
> It can be seen that in the last inspection process, there were more than 100w 
> blocks.
> When the check is over, FSNamesystemLock is released and RpcCall starts 
> working:
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 36 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3488 milliseconds to process 
> from client Call#5571549 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> xxxx.xxxx.xxxx.xxxx:35727
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 135 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3472 milliseconds to process 
> from client Call#36795561 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> xxxx.xxxx.xxxx.xxxx:37793
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 108 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3445 milliseconds to process 
> from client Call#5497586 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> xxxx.xxxx.xxxx.xxxx:23475
> '
> '
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 33 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3435 milliseconds to process 
> from client Call#6043903 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> xxxx.xxxx.xxxx.xxxx:34746
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 139 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#274471 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> xxxx.xxxx.xxxx.xxxx:46419
> 2022-05-25 13:46:09,712 [4831384907] - WARN  [IPC Server handler 77 on 
> 8021:Server@494] - Slow RPC : sendHeartbeat took 3436 milliseconds to process 
> from client Call#73375524 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> xxxx.xxxx.xxxx.xxxx:34241
> '
> Since RpcCall is waiting for a long time, RpcQueueTime+RpcProcessingTime will 
> be longer than usual. A very large number of RpcCalls were affected during 
> this time.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-16594) Many RpcCalls are blocked for a while while Decommission works

Reply via email to