[
https://issues.apache.org/jira/browse/HDFS-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891964#comment-16891964
]
Erik Krogen edited comment on HDFS-13783 at 7/24/19 4:06 PM:
-------------------------------------------------------------
Hi [~zhangchen], things are looking good!
{quote}The failed test is not related with this patch, all these tests can pass
in local
{quote}
I don't believe you :) Take a look at the output for {{TestHdfsConfigFields}}:
{code:java}
missing in hdfs-default.xml Entries: dfs.balancer.service.interval.seconds
dfs.balancer.service.retry.on.exception expected:<0> but was:<2>
{code}
Please also tackle the outstanding checkstyle issues (besides line length
warnings in {{DFSConfigKeys}}).
Moving on to my review:
# The variable {{running}} seems to be intended to be called from a
multi-threaded context, can you please make it {{volatile}}?
# In the new {{run()}} method, I think we could make the flow of control more
clear by avoiding all of the other logic when {{runAsService}} is false:
{code:java}
if (!p.getRunAsService()) {
return doBalance(namenodes, p, conf);
}
{code}
Then, the rest of the method can avoid logic handling the non-service scenario.
The {{Cli#run()}} method will handle exception printing and such.
# You have a typo: {{scheduleInteral}} -> {{scheduleInterval}}
# I think it's misleading that {{tried = 1}} before any attempts have been
made. Can we make it so that {{tried = 0}} at the beginning? We can change the
check in the catch block to be {{tried >= retryOnException}} or {{++tried >
retryOnException}}
# Please use the slf4j style loggers, for example, instead of:
{code:java}
LOG.warn("Encounter exception while do balance work: " + e
+ " already tried " + tried + " times");
{code}
use:
{code:java}
LOG.warn("Encounter exception while do balance work. Already tried {}
times", tried, e);
{code}
# For {{DFS_BALANCER_SERVICE_INTERVAL_SECONDS}}, now that we use
{{Configuration#getTimeDuration()}}, we don't need to specify units as the
user/administrator can do so themselves. I think we can rename it to
{{DFS_BALANCER_SERVICE_INTERVAL}} / {{dfs.balancer.service.interval}}
# Can we rename {{DFS_BALANCER_SERVICE_RETRY_ON_EXCEPTION}} to
{{DFS_BALANCER_RETRIES_ON_EXCEPTION}} (and the same change in the config key)?
"retry on exception", to me, sounds like it would be a boolean flag of whether
to retry at all. Retries makes it sound more like a number.
# When you load {{scheduleInterval}}, you get it in seconds, but when you use
it (for {{Thread#sleep()}}), it needs to be milliseconds. Why not just load it
as milliseconds from the beginning:
{code:java}
scheduleInterval = conf.getTimeDuration(
DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_KEY,
DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_DEFAULT,
TimeUnit.SECONDS, TimeUnit.MILLISECONDS);
{code}
# {{Balancer}} has a convenient {{time2Str(long)}} method, can we use that for
the "Finished one round, will wait for ..." log statement?
# It seems that {{tried}} is never reset. I think we should reset it back to 0
whenever we have a successful balance; it doesn't seem right for the service to
have a limited number of retries over its entire lifetime.
# It looks like {{TestBalancerService#removeExtraBlocks()}} is not used
anywhere, can we remove it?
# I think the {{setupCluster}} / {{addOneDataNode}} / {{newBalancerService}}
methods should all be private?
# Why do we have {{Thread.sleep(500)}} on {{TestBalancerService}} L71? If
you're waiting for the NN to become active, we should use
{{cluster.waitActive(0)}} to avoid test flakiness. I'm also worried about the
{{Thread.sleep(5000)}} call on L139. Is there any thing we can replace this
with which will be less flaky?
# I think we should be using {{finally}} blocks to close the {{cluster}} you
create in the tests.
# {{args}} created on {{TestBalancerService}} L159 is unused.
# For {{Thread.sleep(10)}} on {{TestBalancerService}} L170, the comment says 10
seconds, but the code says 10 milliseconds.
was (Author: xkrogen):
Hi [~zhangchen], things are looking good!
{quote}The failed test is not related with this patch, all these tests can pass
in local
{quote}
I don't believe you :) Take a look at the output for {{TestHdfsConfigFields}}:
{code:java}
missing in hdfs-default.xml Entries: dfs.balancer.service.interval.seconds
dfs.balancer.service.retry.on.exception expected:<0> but was:<2>
{code}
Please also tackle the outstanding checkstyle issues (besides line length
warnings in {{DFSConfigKeys}}).
Moving on to my review:
* The variable {{running}} seems to be intended to be called from a
multi-threaded context, can you please make it {{volatile}}?
* In the new {{run()}} method, I think we could make the flow of control more
clear by avoiding all of the other logic when {{runAsService}} is false:
{code:java}
if (!p.getRunAsService()) {
return doBalance(namenodes, p, conf);
}
{code}
Then, the rest of the method can avoid logic handling the non-service scenario.
The {{Cli#run()}} method will handle exception printing and such.
* You have a typo: {{scheduleInteral}} -> {{scheduleInterval}}
* I think it's misleading that {{tried = 1}} before any attempts have been
made. Can we make it so that {{tried = 0}} at the beginning? We can change the
check in the catch block to be {{tried >= retryOnException}} or {{++tried >
retryOnException}}
* Please use the slf4j style loggers, for example, instead of:
{code:java}
LOG.warn("Encounter exception while do balance work: " + e
+ " already tried " + tried + " times");
{code}
use:
{code:java}
LOG.warn("Encounter exception while do balance work. Already tried {}
times", tried, e);
{code}
* For {{DFS_BALANCER_SERVICE_INTERVAL_SECONDS}}, now that we use
{{Configuration#getTimeDuration()}}, we don't need to specify units as the
user/administrator can do so themselves. I think we can rename it to
{{DFS_BALANCER_SERVICE_INTERVAL}} / {{dfs.balancer.service.interval}}
* Can we rename {{DFS_BALANCER_SERVICE_RETRY_ON_EXCEPTION}} to
{{DFS_BALANCER_RETRIES_ON_EXCEPTION}} (and the same change in the config key)?
"retry on exception", to me, sounds like it would be a boolean flag of whether
to retry at all. Retries makes it sound more like a number.
* When you load {{scheduleInterval}}, you get it in seconds, but when you use
it (for {{Thread#sleep()}}), it needs to be milliseconds. Why not just load it
as milliseconds from the beginning:
{code:java}
scheduleInterval = conf.getTimeDuration(
DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_KEY,
DFSConfigKeys.DFS_BALANCER_SERVICE_INTERVAL_SECONDS_DEFAULT,
TimeUnit.SECONDS, TimeUnit.MILLISECONDS);
{code}
* {{Balancer}} has a convenient {{time2Str(long)}} method, can we use that for
the "Finished one round, will wait for ..." log statement?
* It seems that {{tried}} is never reset. I think we should reset it back to 0
whenever we have a successful balance; it doesn't seem right for the service to
have a limited number of retries over its entire lifetime.
* It looks like {{TestBalancerService#removeExtraBlocks()}} is not used
anywhere, can we remove it?
* I think the {{setupCluster}} / {{addOneDataNode}} / {{newBalancerService}}
methods should all be private?
* Why do we have {{Thread.sleep(500)}} on {{TestBalancerService}} L71? If
you're waiting for the NN to become active, we should use
{{cluster.waitActive(0)}} to avoid test flakiness. I'm also worried about the
{{Thread.sleep(5000)}} call on L139. Is there any thing we can replace this
with which will be less flaky?
* I think we should be using {{finally}} blocks to close the {{cluster}} you
create in the tests.
* {{args}} created on {{TestBalancerService}} L159 is unused.
* For {{Thread.sleep(10)}} on {{TestBalancerService}} L170, the comment says
10 seconds, but the code says 10 milliseconds.
> Balancer: make balancer to be a long service process for easy to monitor it.
> ----------------------------------------------------------------------------
>
> Key: HDFS-13783
> URL: https://issues.apache.org/jira/browse/HDFS-13783
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: balancer & mover
> Reporter: maobaolong
> Assignee: Chen Zhang
> Priority: Major
> Attachments: HDFS-13783-001.patch, HDFS-13783-002.patch,
> HDFS-13783.003.patch
>
>
> If we have a long service process of balancer, like namenode, datanode, we
> can get metrics of balancer, the metrics can tell us the status of balancer,
> the amount of block it has moved,
> We can get or set the balance plan by the balancer webUI. So many things we
> can do if we have a long balancer service process.
> So, shall we start to plan the new Balancer? Hope this feature can enter the
> next release of hadoop.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]