[ 
https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15974123#comment-15974123
 ] 

Zhe Zhang commented on HDFS-11384:
----------------------------------

Thanks [~shv], the main logic LGTM. I could not reproduce reported unit test 
failures either.

+1 pending a few final comments:
# IIUC, {{BALANCER_NUM_RPC_PER_SEC}} is a best-effort throttling target, 
instead of a guaranteed threshold. E.g. it looks possible for 
{{Thread.sleep(delay)}} to be interrupted and {{getBlockList}} to be retried in 
the while loop. Or the entire {{dispatchBlocks}} call in a thread could die 
before {{delay}} seconds, then another {{future\[j\]}} will be issued without 
the delay. (Assuming this understanding is correct), I think this is the right 
way to handle this logic -- it is a good idea not to optimize for these rare 
cases. But can we update the documentation for {{BALANCER_NUM_RPC_PER_SEC}} to 
reflect it?
# {{private void dispatchBlocks(long delay) {}} doesn't explain {{delay}} in 
its Javadoc.
# What does {{testBalancerRPCDelay}} verify? It is not checking the number of 
RPC calls.

> Add option for balancer to disperse getBlocks calls to avoid NameNode's 
> rpc.CallQueueLength spike
> -------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11384
>                 URL: https://issues.apache.org/jira/browse/HDFS-11384
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: balancer & mover
>    Affects Versions: 2.7.3
>            Reporter: yunjiong zhao
>            Assignee: yunjiong zhao
>         Attachments: balancer.day.png, balancer.week.png, 
> HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, 
> HDFS-11384.004.patch, HDFS-11384.005.patch
>
>
> When running balancer on hadoop cluster which have more than 3000 Datanodes 
> will cause NameNode's rpc.CallQueueLength spike. We observed this situation 
> could cause Hbase cluster failure due to RegionServer's WAL timeout.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to