[jira] [Comment Edited] (HDFS-13783) Balancer: make balancer to be a long service process for easy to monitor it.

Erik Krogen (JIRA) Wed, 17 Jul 2019 10:20:27 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887263#comment-16887263
 ]


Erik Krogen edited comment on HDFS-13783 at 7/17/19 5:19 PM:
-------------------------------------------------------------

Hey [~zhangchen], the idea is good, but I think we need some more work here 
before we're ready to declare success... Top-level items to handle would be:

* A way to start/stop the service. Maybe we can re-use the existing {{hdfs 
balancer}} command with a {{--service}} flag, and then we should be able to 
just use {{hdfs --daemon start balancer}}.
* It seems like there should be more intelligent failure/exception handling: a 
service which goes down every time it encounters an issue doesn't seem that 
useful, maybe we should at least add some configurable count or time period of 
retry-on-failure.
* Adding a new service seems like it deserves some documentation -- we can 
probably piggyback off of the existing [balancer 
documentation|https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer]
 in the user guide.

A few more specific comments:
* {{BalancerService}} has a constructor, but it doesn't seem that it is used 
anywhere.
* On L27, you shouldn't be creating a {{new HdfsConfiguration()}}, since this 
won't respect config overrides on the CLI. You should use {{ToolRunner}} here 
to initialize the config.
* Why do we need {{SecurityUtil.login()}}? Doesn't the balancer support 
automatically doing login via keytab already (HDFS-9804)?
* We shouldn't be ignoring {{InterruptedException}}, this is how to indicate to 
a service that it should exit. They should be fatal.
* Instead of using {{conf.getLong()}}, we should use 
{{conf.getTimeDuration()}}, allowing administrators to use human-readable time 
values.

By the way, your patch should be named like {{HDFS-13783.002.patch}} (period 
before version number, not hyphen). See the [patch naming 
guide|https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute#HowToContribute-Namingyourpatch].


was (Author: xkrogen):
Hey [~zhangchen], the idea is good, but I think we need some more work here 
before we're ready to declare success... Top-level items to handle would be:

* A way to start/stop the service. Maybe we can re-use the existing {{hdfs 
balancer}} command with a {{--service}} flag, and then we should be able to 
just use {{hdfs --daemon start balancer}}.
* It seems like there should be more intelligent failure/exception handling: a 
service which goes down every time it encounters an issue doesn't seem that 
useful, maybe we should at least add some configurable count or time period of 
retry-on-failure.
* Adding a new service seems like it deserves some documentation -- we can 
probably piggyback off of the existing [balancer 
documentation|https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Balancer]
 in the user guide.

A few more specific comments:
* {{BalancerService}} has a constructor, but it doesn't seem that it is used 
anywhere.
* On L27, you shouldn't be creating a {{new HdfsConfiguration()}}, since this 
won't respect config overrides on the CLI. You should use {{ToolRunner}} here 
to initialize the config.
* Why do we need {{SecurityUtil.login()}}? Doesn't the balancer support 
automatically doing login via keytab already (HDFS-9804)?
* We shouldn't be ignoring {{InterruptedException}}, this is how to indicate to 
a service that it should exit. They should be fatal.
* Instead of using {{conf.getLong()}}, we should use 
{{conf.getTimeDuration()}}, allowing administrators to use human-readable time 
values.

By the way, your patch should be named like "HDFS-13783.002.patch" (period 
before version number, not hyphen). See the [patch naming 
guide|https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute#HowToContribute-Namingyourpatch].

> Balancer: make balancer to be a long service process for easy to monitor it.
> ----------------------------------------------------------------------------
>
>                 Key: HDFS-13783
>                 URL: https://issues.apache.org/jira/browse/HDFS-13783
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: balancer &amp; mover
>            Reporter: maobaolong
>            Assignee: Chen Zhang
>            Priority: Major
>         Attachments: HDFS-13783-001.patch, HDFS-13783-002.patch
>
>
> If we have a long service process of balancer, like namenode, datanode, we 
> can get metrics of balancer, the metrics can tell us the status of balancer, 
> the amount of block it has moved, 
> We can get or set the balance plan by the balancer webUI. So many things we 
> can do if we have a long balancer service process.
> So, shall we start to plan the new Balancer? Hope this feature can enter the 
> next release of hadoop.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-13783) Balancer: make balancer to be a long service process for easy to monitor it.

Reply via email to