[
https://issues.apache.org/jira/browse/HDFS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088012#comment-15088012
]
Andrew Wang commented on HDFS-1312:
-----------------------------------
Hi Anu, thanks for picking up this JIRA, it's a long-standing issue. It's clear
from the docs that you've put a lot of thought into the problem, and I hope
that some of the same ideas could someday be carried over to the existing
balancer too. I really like separating planning from execution, since it'll
make the unit tests actual unit tests (no minicluster!). This is a real issue
with the existing balancer, the tests take forever to run and don't converge
reliably.
I did have a few comments though, hoping we can get clarity on the usecases and
some of the resulting design decisions:
Section 2: HDFS-1804 has been around for a while and used successfully by many
of our users, so "lack of real world data or adoption" is not entirely correct.
We've even considered making it the default, see HDFS-8538 where the consensus
was that we could do this if we add some additional throttling.
IMO the usecase to focus on is the addition of fresh drives, particularly in
the context of hotswap. I'm unconvinced that intra-node imbalance happens
naturally when HDFS-1804 is enabling, and enabling HDFS-1804 is essential if a
cluster is commonly suffering from intra-DN imbalance (e.g. from differently
sized disks on a node). This means we should only see intra-node imbalance on
admin action like adding a new drive; a singular, administrator-triggered
operation.
With that in mind, I wonder if we can limit the scope of this effort. I like
the idea of an online balancer; the previous hacky scripts required downtime,
which is unacceptable with hotswap. However, I don't understand the need for
cluster-wide reporting and orchestration. With HDFS-1804, intra-node imbalance
should only happen when an admin adding a new drive, the admin can also know to
trigger the intra-DN balancer when doing this. If they forget, Ganglia will
light up and remind them.
What I'm envisioning is a user experience where admin just points it at a DN,
like:
{noformat}
hdfs balancer -volumes -datanode 1.2.3.4:50070
{noformat}
Namely, this avoids the Discover step, simplifying things. There's also no
global planning step, I think most of this functionality should live in the DN
since it's better equipped to do IO throttling and mutual exclusion. Basically
we'd send an RPC to tell the DN to balance itself (with parameters), and then
poll another RPC to watch the status and wait until it's done.
On the topic of the actual balancing, how do we atomically move the block in
the presence of failures? Right now the NN expects only one replica per DN, so
if the same replica is on multiple volumes of the DN, we could run into issues.
See related (quite serious) issues like HDFS-7443 and HDFS-7960. I think we can
do some tricks with a temp filename and rename, but this procedure should be
carefully explained.
Thanks,
Andrew
> Re-balance disks within a Datanode
> ----------------------------------
>
> Key: HDFS-1312
> URL: https://issues.apache.org/jira/browse/HDFS-1312
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode
> Reporter: Travis Crawford
> Assignee: Anu Engineer
> Attachments: Architecture_and_testplan.pdf, disk-balancer-proposal.pdf
>
>
> Filing this issue in response to ``full disk woes`` on hdfs-user.
> Datanodes fill their storage directories unevenly, leading to situations
> where certain disks are full while others are significantly less used. Users
> at many different sites have experienced this issue, and HDFS administrators
> are taking steps like:
> - Manually rebalancing blocks in storage directories
> - Decomissioning nodes & later readding them
> There's a tradeoff between making use of all available spindles, and filling
> disks at the sameish rate. Possible solutions include:
> - Weighting less-used disks heavier when placing new blocks on the datanode.
> In write-heavy environments this will still make use of all spindles,
> equalizing disk use over time.
> - Rebalancing blocks locally. This would help equalize disk use as disks are
> added/replaced in older cluster nodes.
> Datanodes should actively manage their local disk so operator intervention is
> not needed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)