[jira] [Comment Edited] (HDFS-13783) Balancer: make balancer to be a long service process for easy to monitor it.

Erik Krogen (JIRA) Wed, 24 Jul 2019 09:08:00 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891964#comment-16891964
 ]


Erik Krogen edited comment on HDFS-13783 at 7/24/19 4:06 PM:
-------------------------------------------------------------

Hi [~zhangchen], things are looking good!
{quote}The failed test is not related with this patch, all these tests can pass 
in local
{quote}
I don't believe you :) Take a look at the output for {{TestHdfsConfigFields}}:
{code:java}
missing in hdfs-default.xml Entries:   dfs.balancer.service.interval.seconds  
dfs.balancer.service.retry.on.exception expected:<0> but was:<2>
{code}
Please also tackle the outstanding checkstyle issues (besides line length 
warnings in {{DFSConfigKeys}}).

Moving on to my review:
# The variable {{running}} seems to be intended to be called from a 
multi-threaded context, can you please make it {{volatile}}?
# In the new {{run()}} method, I think we could make the flow of control more 
clear by avoiding all of the other logic when {{runAsService}} is false:
{code:java}
if (!p.getRunAsService()) {
  return doBalance(namenodes, p, conf);
}
{code}
Then, the rest of the method can avoid logic handling the non-service scenario. 
The {{Cli#run()}} method will handle exception printing and such.
# You have a typo: {{scheduleInteral}} -> {{scheduleInterval}}
# I think it's misleading that {{tried = 1}} before any attempts have been 
made. Can we make it so that {{tried = 0}} at the beginning? We can change the 
check in the catch block to be {{tried >= retryOnException}} or {{++tried > 
retryOnException}}
# Please use the slf4j style loggers, for example, instead of:
{code:java}
        LOG.warn("Encounter exception while do balance work: " + e
            + " already tried " + tried + " times");
{code}
use:
{code:java}
        LOG.warn("Encounter exception while do balance work. Already tried {} 
times", tried, e);
{code}
# For {{DFS_BALANCER_SERVICE_INTERVAL_SECONDS}}, now that we use 
{{Configuration#getTimeDuration()}}, we don't need to specify units as the 
user/administrator can do so themselves. I think we can rename it to 
{{DFS_BALANCER_SERVICE_INTERVAL}} / {{dfs.balancer.service.interval}}
# Can we rename {{DFS_BALANCER_SERVICE_RETRY_ON_EXCEPTION}} to 
{{DFS_BALANCER_RETRIES_ON_EXCEPTION}} (and the same change in the config key)? 
"retry on exception", to me, sounds like it would be a boolean flag of whether 
to retry at all. Retries makes it sound more like a number.
# When you load {{scheduleInterval}}, you get it in seconds, but when you use 
it (for {{Thread#sleep()}}), it needs to be milliseconds. Why not just load it 
as milliseconds from the beginning:
{code:java}
      scheduleInterval = conf.getTimeDuration(
          DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_KEY,
          DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_DEFAULT,
          TimeUnit.SECONDS, TimeUnit.MILLISECONDS);
{code}
# {{Balancer}} has a convenient {{time2Str(long)}} method, can we use that for 
the "Finished one round, will wait for ..." log statement?
# It seems that {{tried}} is never reset. I think we should reset it back to 0 
whenever we have a successful balance; it doesn't seem right for the service to 
have a limited number of retries over its entire lifetime.
# It looks like {{TestBalancerService#removeExtraBlocks()}} is not used 
anywhere, can we remove it?
# I think the {{setupCluster}} / {{addOneDataNode}} / {{newBalancerService}} 
methods should all be private?
# Why do we have {{Thread.sleep(500)}} on {{TestBalancerService}} L71? If 
you're waiting for the NN to become active, we should use 
{{cluster.waitActive(0)}} to avoid test flakiness. I'm also worried about the 
{{Thread.sleep(5000)}} call on L139. Is there any thing we can replace this 
with which will be less flaky?
# I think we should be using {{finally}} blocks to close the {{cluster}} you 
create in the tests.
# {{args}} created on {{TestBalancerService}} L159 is unused.
# For {{Thread.sleep(10)}} on {{TestBalancerService}} L170, the comment says 10 
seconds, but the code says 10 milliseconds.


was (Author: xkrogen):
Hi [~zhangchen], things are looking good!
{quote}The failed test is not related with this patch, all these tests can pass 
in local
{quote}
I don't believe you :) Take a look at the output for {{TestHdfsConfigFields}}:
{code:java}
missing in hdfs-default.xml Entries:   dfs.balancer.service.interval.seconds  
dfs.balancer.service.retry.on.exception expected:<0> but was:<2>
{code}
Please also tackle the outstanding checkstyle issues (besides line length 
warnings in {{DFSConfigKeys}}).

Moving on to my review:
 * The variable {{running}} seems to be intended to be called from a 
multi-threaded context, can you please make it {{volatile}}?
 * In the new {{run()}} method, I think we could make the flow of control more 
clear by avoiding all of the other logic when {{runAsService}} is false:
{code:java}
if (!p.getRunAsService()) {
  return doBalance(namenodes, p, conf);
}
{code}
Then, the rest of the method can avoid logic handling the non-service scenario. 
The {{Cli#run()}} method will handle exception printing and such.

 * You have a typo: {{scheduleInteral}} -> {{scheduleInterval}}
 * I think it's misleading that {{tried = 1}} before any attempts have been 
made. Can we make it so that {{tried = 0}} at the beginning? We can change the 
check in the catch block to be {{tried >= retryOnException}} or {{++tried > 
retryOnException}}
 * Please use the slf4j style loggers, for example, instead of:
{code:java}
        LOG.warn("Encounter exception while do balance work: " + e
            + " already tried " + tried + " times");
{code}
use:
{code:java}
        LOG.warn("Encounter exception while do balance work. Already tried {} 
times", tried, e);
{code}

 * For {{DFS_BALANCER_SERVICE_INTERVAL_SECONDS}}, now that we use 
{{Configuration#getTimeDuration()}}, we don't need to specify units as the 
user/administrator can do so themselves. I think we can rename it to 
{{DFS_BALANCER_SERVICE_INTERVAL}} / {{dfs.balancer.service.interval}}
 * Can we rename {{DFS_BALANCER_SERVICE_RETRY_ON_EXCEPTION}} to 
{{DFS_BALANCER_RETRIES_ON_EXCEPTION}} (and the same change in the config key)? 
"retry on exception", to me, sounds like it would be a boolean flag of whether 
to retry at all. Retries makes it sound more like a number.
 * When you load {{scheduleInterval}}, you get it in seconds, but when you use 
it (for {{Thread#sleep()}}), it needs to be milliseconds. Why not just load it 
as milliseconds from the beginning:
{code:java}
      scheduleInterval = conf.getTimeDuration(
          DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_KEY,
          DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_DEFAULT,
          TimeUnit.SECONDS, TimeUnit.MILLISECONDS);
{code}

 * {{Balancer}} has a convenient {{time2Str(long)}} method, can we use that for 
the "Finished one round, will wait for ..." log statement?
 * It seems that {{tried}} is never reset. I think we should reset it back to 0 
whenever we have a successful balance; it doesn't seem right for the service to 
have a limited number of retries over its entire lifetime.
 * It looks like {{TestBalancerService#removeExtraBlocks()}} is not used 
anywhere, can we remove it?
 * I think the {{setupCluster}} / {{addOneDataNode}} / {{newBalancerService}} 
methods should all be private?
 * Why do we have {{Thread.sleep(500)}} on {{TestBalancerService}} L71? If 
you're waiting for the NN to become active, we should use 
{{cluster.waitActive(0)}} to avoid test flakiness. I'm also worried about the 
{{Thread.sleep(5000)}} call on L139. Is there any thing we can replace this 
with which will be less flaky?
 * I think we should be using {{finally}} blocks to close the {{cluster}} you 
create in the tests.
 * {{args}} created on {{TestBalancerService}} L159 is unused.
 * For {{Thread.sleep(10)}} on {{TestBalancerService}} L170, the comment says 
10 seconds, but the code says 10 milliseconds.

> Balancer: make balancer to be a long service process for easy to monitor it.
> ----------------------------------------------------------------------------
>
>                 Key: HDFS-13783
>                 URL: https://issues.apache.org/jira/browse/HDFS-13783
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: balancer &amp; mover
>            Reporter: maobaolong
>            Assignee: Chen Zhang
>            Priority: Major
>         Attachments: HDFS-13783-001.patch, HDFS-13783-002.patch, 
> HDFS-13783.003.patch
>
>
> If we have a long service process of balancer, like namenode, datanode, we 
> can get metrics of balancer, the metrics can tell us the status of balancer, 
> the amount of block it has moved, 
> We can get or set the balance plan by the balancer webUI. So many things we 
> can do if we have a long balancer service process.
> So, shall we start to plan the new Balancer? Hope this feature can enter the 
> next release of hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-13783) Balancer: make balancer to be a long service process for easy to monitor it.

Reply via email to