[
https://issues.apache.org/jira/browse/HDFS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088688#comment-15088688
]
Anu Engineer commented on HDFS-1312:
------------------------------------
Hi [~andrew.wang] Thanks for the quick response. I was thinking about what you
said and I think the the disconnect we are having is because you are assuming
that HDFS-1804 is always available on the HDFS clusters, but for many customers
that is not true.
bq. Have these customers tried out HDFS-1804? We've had users complain about
imbalance before too, and after enabling HDFS-1804, no further issues.
Generally administrators are wary of enabling a feature like HDFS-1804 in a
production cluster. For new clusters it is more easier but for existing
production clusters assuming the existence of HDFS-1804 is not realistic.
bq. If you're interested, feel free to take HDFS-8538 from me. I really think
it'll fix the majority of imbalance issues outside of hotswap.
Thank you for the offer. I can certainly work on HDFS-8538 after HDFS-1312. I
think of HDFS-1804, HDFS-8538 and HDFS-1312 as part of the solution to the same
problem. Just attacking it from different angles, without all three some part
of HDFS users will always be left out.
bq. When I mentioned removing the discover phase, I meant the NN communication.
Here, the DN just probes its own volume information. Does it need to talk to
the NN for anything else?
# I am not able to see any particular advantage is doing planning inside the
datanode. However with that approach we do lose one the critical feature of the
tool, that is ability to report what we did to the machine, capturing the
"before state" is much more complex. I agree that users can manually capture
this info before and after, but that is an extra administrative burden. With
the current approach, we record the information of a datanode before and we
make it easy to compare the state once we are done.
# Also We wanted to merge mover into this engine later, with the current
approach what we build inside datanode is a simple block mover, or to be
precise an RPC interface to the existing mover's block interface. You can feed
it any move commands you like, which provides better composability. With what
you are suggesting we will lose that flexibility.
# Less complexity inside datanode, planner code never needs to run inside
datanode, it is a piece of code that plans a set of moves, why would we want
to run it inside datanode ?.
# Since none of our tools currently report the disk level data distribution,
without talking to Namenode it is not possible to find which nodes are
imbalanced. I know that you are arguing that it will never happen, if all
customers use HDFS-1804. Two issues with that, one there are lots of customers
without HDFS-1804, and HDFS-1804 is just an option that user can choose. Since
it is configurable, it is arguable that we will always have customers without
HDFS-1804. The current architecture addresses the needs of both group of users.
In that sense current architecture is a better or more encompassing
architecture.
bq. Cluster-wide disk information is already handled by monitoring tools, no?
The admin gets the Ganglia alert saying some node is imbalanced, admin triggers
intranode balancer, admin keeps looking at Ganglia to see if it's fixed. I
don't think adding our own monitoring of the same information helps, when
Ganglia etc. are already available, in-use, and understood by admins.
Please correct me if I am missing something here, Getting an alert due to low
space on disk from datanode is very reactive. With the current disk balancer
design we are assuming that disk balancer tool can be run to address and fix
any issue you have in the cluster. You could argue that admins can write some
scripts to monitor this issue using ganglia command line, but it is common
enough problem that I think it should be solved at HDFS level.
Here are two use cases that disk balancer addresses. First one is discovering
nodes with potential issues and the second one is auto fixing those issues.
This is very similar to current balancer.
1. Scenario 1: Admin can run {noformat} hdfs diskbalancer -top 100 {noformat} ,
and viola! we print out the top 100 hundred nodes that is having a problem. Let
us say that admin now wants to look closely at the node and find out
distribution on individual disks, he can now do that via. disk balancer, ssh or
ganglia.
2. Scenario 2 : Admin wants not to be bothered with this balancing act at all,
in reality he is thinking why doesn't HDFS just take care of this(I know
HDFS-1804 is addressing that, but again we are talking about cluster which does
not have it enabled.) and in that case we will let the admin run {noformat}
hdfs diskbalancer -top 10 -balance {noformat}, this allows the admin to run
disk balancer just like current balancer, without having to worry about what is
happening or measuring each node. With gangalia a bunch of nodes will fire
alerts, admin needs to copy the address of each datanode and give it to disk
balancer. I think the current flow of disk balancer makes it easier to use.
bq. I don't think this conflicts with the debuggability goal. The DN can dump
the Info object (and even the Plan) object) if requested, to the log or
somewhere in a data dir.
Well, it is debuggable, but assuming that I am the one who will be called on to
debug this, I prefer to debug by looking at my local directory instead of
ssh-ing into a datanode. I think of writing to local directory as a gift I am
making to my future self :). Plus as mentioned earlier the other use case
where we want to report to user what our tool did, fetching this data out of
datanode's log directory is hard(may be another RPC to fetch it ??).
bq. Adding a note that says "we use the existing moveBlockAcrossStorage method"
is a great answer.
I will update the design doc with this info. Thanks for your suggestions.
> Re-balance disks within a Datanode
> ----------------------------------
>
> Key: HDFS-1312
> URL: https://issues.apache.org/jira/browse/HDFS-1312
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode
> Reporter: Travis Crawford
> Assignee: Anu Engineer
> Attachments: Architecture_and_testplan.pdf, disk-balancer-proposal.pdf
>
>
> Filing this issue in response to ``full disk woes`` on hdfs-user.
> Datanodes fill their storage directories unevenly, leading to situations
> where certain disks are full while others are significantly less used. Users
> at many different sites have experienced this issue, and HDFS administrators
> are taking steps like:
> - Manually rebalancing blocks in storage directories
> - Decomissioning nodes & later readding them
> There's a tradeoff between making use of all available spindles, and filling
> disks at the sameish rate. Possible solutions include:
> - Weighting less-used disks heavier when placing new blocks on the datanode.
> In write-heavy environments this will still make use of all spindles,
> equalizing disk use over time.
> - Rebalancing blocks locally. This would help equalize disk use as disks are
> added/replaced in older cluster nodes.
> Datanodes should actively manage their local disk so operator intervention is
> not needed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)