[
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Shvachko updated HDFS-11384:
---------------------------------------
Attachment: HDFS-11384.003.patch
Here is a relatively simple patch, which restricts the number of RPC calls from
Balancer to NN to 20 calls per second.
20 calls per second is a constant for now. It is chosen so that Balancer calls
could not saturate NN's RPC queue based on metrics from a large cluster I was
observing. LMK if people prefer it to be configurable.
On a large cluster with 200 (default) dispatcher threads, and e.g. 500
underutilized DNs (sources) the initial 200 RPCs will be dispersed over 200 /
20 = 10 seconds. The remaining 300 RPCs should disperse organically as they
subsequently reuse the same 200 threads from the pool.
The patch has a unit test, which triggers the dispersion logic.
> Add option for balancer to disperse getBlocks calls to avoid NameNode's
> rpc.CallQueueLength spike
> -------------------------------------------------------------------------------------------------
>
> Key: HDFS-11384
> URL: https://issues.apache.org/jira/browse/HDFS-11384
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: balancer & mover
> Affects Versions: 2.7.3
> Reporter: yunjiong zhao
> Assignee: yunjiong zhao
> Attachments: balancer.day.png, balancer.week.png,
> HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation
> could cause Hbase cluster failure due to RegionServer's WAL timeout.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]