[ https://issues.apache.org/jira/browse/HDFS-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891964#comment-16891964 ]
Erik Krogen edited comment on HDFS-13783 at 7/24/19 4:06 PM: ------------------------------------------------------------- Hi [~zhangchen], things are looking good! {quote}The failed test is not related with this patch, all these tests can pass in local {quote} I don't believe you :) Take a look at the output for {{TestHdfsConfigFields}}: {code:java} missing in hdfs-default.xml Entries: dfs.balancer.service.interval.seconds dfs.balancer.service.retry.on.exception expected:<0> but was:<2> {code} Please also tackle the outstanding checkstyle issues (besides line length warnings in {{DFSConfigKeys}}). Moving on to my review: # The variable {{running}} seems to be intended to be called from a multi-threaded context, can you please make it {{volatile}}? # In the new {{run()}} method, I think we could make the flow of control more clear by avoiding all of the other logic when {{runAsService}} is false: {code:java} if (!p.getRunAsService()) { return doBalance(namenodes, p, conf); } {code} Then, the rest of the method can avoid logic handling the non-service scenario. The {{Cli#run()}} method will handle exception printing and such. # You have a typo: {{scheduleInteral}} -> {{scheduleInterval}} # I think it's misleading that {{tried = 1}} before any attempts have been made. Can we make it so that {{tried = 0}} at the beginning? We can change the check in the catch block to be {{tried >= retryOnException}} or {{++tried > retryOnException}} # Please use the slf4j style loggers, for example, instead of: {code:java} LOG.warn("Encounter exception while do balance work: " + e + " already tried " + tried + " times"); {code} use: {code:java} LOG.warn("Encounter exception while do balance work. Already tried {} times", tried, e); {code} # For {{DFS_BALANCER_SERVICE_INTERVAL_SECONDS}}, now that we use {{Configuration#getTimeDuration()}}, we don't need to specify units as the user/administrator can do so themselves. I think we can rename it to {{DFS_BALANCER_SERVICE_INTERVAL}} / {{dfs.balancer.service.interval}} # Can we rename {{DFS_BALANCER_SERVICE_RETRY_ON_EXCEPTION}} to {{DFS_BALANCER_RETRIES_ON_EXCEPTION}} (and the same change in the config key)? "retry on exception", to me, sounds like it would be a boolean flag of whether to retry at all. Retries makes it sound more like a number. # When you load {{scheduleInterval}}, you get it in seconds, but when you use it (for {{Thread#sleep()}}), it needs to be milliseconds. Why not just load it as milliseconds from the beginning: {code:java} scheduleInterval = conf.getTimeDuration( DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_KEY, DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_DEFAULT, TimeUnit.SECONDS, TimeUnit.MILLISECONDS); {code} # {{Balancer}} has a convenient {{time2Str(long)}} method, can we use that for the "Finished one round, will wait for ..." log statement? # It seems that {{tried}} is never reset. I think we should reset it back to 0 whenever we have a successful balance; it doesn't seem right for the service to have a limited number of retries over its entire lifetime. # It looks like {{TestBalancerService#removeExtraBlocks()}} is not used anywhere, can we remove it? # I think the {{setupCluster}} / {{addOneDataNode}} / {{newBalancerService}} methods should all be private? # Why do we have {{Thread.sleep(500)}} on {{TestBalancerService}} L71? If you're waiting for the NN to become active, we should use {{cluster.waitActive(0)}} to avoid test flakiness. I'm also worried about the {{Thread.sleep(5000)}} call on L139. Is there any thing we can replace this with which will be less flaky? # I think we should be using {{finally}} blocks to close the {{cluster}} you create in the tests. # {{args}} created on {{TestBalancerService}} L159 is unused. # For {{Thread.sleep(10)}} on {{TestBalancerService}} L170, the comment says 10 seconds, but the code says 10 milliseconds. was (Author: xkrogen): Hi [~zhangchen], things are looking good! {quote}The failed test is not related with this patch, all these tests can pass in local {quote} I don't believe you :) Take a look at the output for {{TestHdfsConfigFields}}: {code:java} missing in hdfs-default.xml Entries: dfs.balancer.service.interval.seconds dfs.balancer.service.retry.on.exception expected:<0> but was:<2> {code} Please also tackle the outstanding checkstyle issues (besides line length warnings in {{DFSConfigKeys}}). Moving on to my review: * The variable {{running}} seems to be intended to be called from a multi-threaded context, can you please make it {{volatile}}? * In the new {{run()}} method, I think we could make the flow of control more clear by avoiding all of the other logic when {{runAsService}} is false: {code:java} if (!p.getRunAsService()) { return doBalance(namenodes, p, conf); } {code} Then, the rest of the method can avoid logic handling the non-service scenario. The {{Cli#run()}} method will handle exception printing and such. * You have a typo: {{scheduleInteral}} -> {{scheduleInterval}} * I think it's misleading that {{tried = 1}} before any attempts have been made. Can we make it so that {{tried = 0}} at the beginning? We can change the check in the catch block to be {{tried >= retryOnException}} or {{++tried > retryOnException}} * Please use the slf4j style loggers, for example, instead of: {code:java} LOG.warn("Encounter exception while do balance work: " + e + " already tried " + tried + " times"); {code} use: {code:java} LOG.warn("Encounter exception while do balance work. Already tried {} times", tried, e); {code} * For {{DFS_BALANCER_SERVICE_INTERVAL_SECONDS}}, now that we use {{Configuration#getTimeDuration()}}, we don't need to specify units as the user/administrator can do so themselves. I think we can rename it to {{DFS_BALANCER_SERVICE_INTERVAL}} / {{dfs.balancer.service.interval}} * Can we rename {{DFS_BALANCER_SERVICE_RETRY_ON_EXCEPTION}} to {{DFS_BALANCER_RETRIES_ON_EXCEPTION}} (and the same change in the config key)? "retry on exception", to me, sounds like it would be a boolean flag of whether to retry at all. Retries makes it sound more like a number. * When you load {{scheduleInterval}}, you get it in seconds, but when you use it (for {{Thread#sleep()}}), it needs to be milliseconds. Why not just load it as milliseconds from the beginning: {code:java} scheduleInterval = conf.getTimeDuration( DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_KEY, DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_DEFAULT, TimeUnit.SECONDS, TimeUnit.MILLISECONDS); {code} * {{Balancer}} has a convenient {{time2Str(long)}} method, can we use that for the "Finished one round, will wait for ..." log statement? * It seems that {{tried}} is never reset. I think we should reset it back to 0 whenever we have a successful balance; it doesn't seem right for the service to have a limited number of retries over its entire lifetime. * It looks like {{TestBalancerService#removeExtraBlocks()}} is not used anywhere, can we remove it? * I think the {{setupCluster}} / {{addOneDataNode}} / {{newBalancerService}} methods should all be private? * Why do we have {{Thread.sleep(500)}} on {{TestBalancerService}} L71? If you're waiting for the NN to become active, we should use {{cluster.waitActive(0)}} to avoid test flakiness. I'm also worried about the {{Thread.sleep(5000)}} call on L139. Is there any thing we can replace this with which will be less flaky? * I think we should be using {{finally}} blocks to close the {{cluster}} you create in the tests. * {{args}} created on {{TestBalancerService}} L159 is unused. * For {{Thread.sleep(10)}} on {{TestBalancerService}} L170, the comment says 10 seconds, but the code says 10 milliseconds. > Balancer: make balancer to be a long service process for easy to monitor it. > ---------------------------------------------------------------------------- > > Key: HDFS-13783 > URL: https://issues.apache.org/jira/browse/HDFS-13783 > Project: Hadoop HDFS > Issue Type: New Feature > Components: balancer & mover > Reporter: maobaolong > Assignee: Chen Zhang > Priority: Major > Attachments: HDFS-13783-001.patch, HDFS-13783-002.patch, > HDFS-13783.003.patch > > > If we have a long service process of balancer, like namenode, datanode, we > can get metrics of balancer, the metrics can tell us the status of balancer, > the amount of block it has moved, > We can get or set the balance plan by the balancer webUI. So many things we > can do if we have a long balancer service process. > So, shall we start to plan the new Balancer? Hope this feature can enter the > next release of hadoop. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org