[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933354#comment-15933354 ]
yunjiong zhao edited comment on HDFS-11384 at 3/20/17 7:26 PM: --------------------------------------------------------------- Thanks [~shv] for review. Only when you set dfs.balancer.getBlocks.interval.millis to non-zero, Balancer will only allow one thread to issue getBlocks()at any given time. Otherwise this patch doesn't change anything. So only one change actually. If use wait, it will release the lock, so can't make sure there are only one thread will call getBlocks(). By default, this patch doesn't change anything. So if you need run Balancer aggressively, don't set dfs.balancer.getBlocks.interval.millis. {quote} Can we add some heuristics so that the Balancer could adjust by itself instead of adding the configuration parameter {quote} I though this before. The best way I can thought is add new function in IPC that let clients get the CallQueueLength, if CallQueueLength is too high, block getBlocks() until the CallQueueLength become normal again. was (Author: zhaoyunjiong): Thanks [~shv] for review. Only when you set dfs.balancer.getBlocks.interval.millis to non-zero, Balancer will only allow one thread to issue {code}getBlocks(){code} at any given time. Otherwise this patch doesn't change anything. So only one change actually. If use wait, it will release the lock, so can't make sure there are only one thread will call {code}getBlocks(){code}. By default, this patch doesn't change anything. So if you need run Balancer aggressively, don't set dfs.balancer.getBlocks.interval.millis. {quote} Can we add some heuristics so that the Balancer could adjust by itself instead of adding the configuration parameter {quote} I though this before. The best way I can thought is add new function in IPC that let clients get the CallQueueLength, if CallQueueLength is too high, block getBlocks() until the CallQueueLength become normal again. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > ------------------------------------------------------------------------------------------------- > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover > Affects Versions: 2.7.3 > Reporter: yunjiong zhao > Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org