[ 
https://issues.apache.org/jira/browse/HDFS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088012#comment-15088012
 ] 

Andrew Wang commented on HDFS-1312:
-----------------------------------

Hi Anu, thanks for picking up this JIRA, it's a long-standing issue. It's clear 
from the docs that you've put a lot of thought into the problem, and I hope 
that some of the same ideas could someday be carried over to the existing 
balancer too. I really like separating planning from execution, since it'll 
make the unit tests actual unit tests (no minicluster!). This is a real issue 
with the existing balancer, the tests take forever to run and don't converge 
reliably.

I did have a few comments though, hoping we can get clarity on the usecases and 
some of the resulting design decisions:

Section 2: HDFS-1804 has been around for a while and used successfully by many 
of our users, so "lack of real world data or adoption" is not entirely correct. 
We've even considered making it the default, see HDFS-8538 where the consensus 
was that we could do this if we add some additional throttling.

IMO the usecase to focus on is the addition of fresh drives, particularly in 
the context of hotswap. I'm unconvinced that intra-node imbalance happens 
naturally when HDFS-1804 is enabling, and enabling HDFS-1804 is essential if a 
cluster is commonly suffering from intra-DN imbalance (e.g. from differently 
sized disks on a node). This means we should only see intra-node imbalance on 
admin action like adding a new drive; a singular, administrator-triggered 
operation.

With that in mind, I wonder if we can limit the scope of this effort. I like 
the idea of an online balancer; the previous hacky scripts required downtime, 
which is unacceptable with hotswap. However, I don't understand the need for 
cluster-wide reporting and orchestration. With HDFS-1804, intra-node imbalance 
should only happen when an admin adding a new drive, the admin can also know to 
trigger the intra-DN balancer when doing this. If they forget, Ganglia will 
light up and remind them.

What I'm envisioning is a user experience where admin just points it at a DN, 
like:

{noformat}
hdfs balancer -volumes -datanode 1.2.3.4:50070
{noformat}

Namely, this avoids the Discover step, simplifying things. There's also no 
global planning step, I think most of this functionality should live in the DN 
since it's better equipped to do IO throttling and mutual exclusion. Basically 
we'd send an RPC to tell the DN to balance itself (with parameters), and then 
poll another RPC to watch the status and wait until it's done.

On the topic of the actual balancing, how do we atomically move the block in 
the presence of failures? Right now the NN expects only one replica per DN, so 
if the same replica is on multiple volumes of the DN, we could run into issues. 
See related (quite serious) issues like HDFS-7443 and HDFS-7960. I think we can 
do some tricks with a temp filename and rename, but this procedure should be 
carefully explained.

Thanks,
Andrew

> Re-balance disks within a Datanode
> ----------------------------------
>
>                 Key: HDFS-1312
>                 URL: https://issues.apache.org/jira/browse/HDFS-1312
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode
>            Reporter: Travis Crawford
>            Assignee: Anu Engineer
>         Attachments: Architecture_and_testplan.pdf, disk-balancer-proposal.pdf
>
>
> Filing this issue in response to ``full disk woes`` on hdfs-user.
> Datanodes fill their storage directories unevenly, leading to situations 
> where certain disks are full while others are significantly less used. Users 
> at many different sites have experienced this issue, and HDFS administrators 
> are taking steps like:
> - Manually rebalancing blocks in storage directories
> - Decomissioning nodes & later readding them
> There's a tradeoff between making use of all available spindles, and filling 
> disks at the sameish rate. Possible solutions include:
> - Weighting less-used disks heavier when placing new blocks on the datanode. 
> In write-heavy environments this will still make use of all spindles, 
> equalizing disk use over time.
> - Rebalancing blocks locally. This would help equalize disk use as disks are 
> added/replaced in older cluster nodes.
> Datanodes should actively manage their local disk so operator intervention is 
> not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to