[jira] [Commented] (HDFS-1312) Re-balance disks within a Datanode

Andrew Wang (JIRA) Thu, 07 Jan 2016 16:53:24 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088505#comment-15088505
 ]


Andrew Wang commented on HDFS-1312:
-----------------------------------

Hi Anu, thanks for the reply,

bq. Our own experience from the field is that many customers routinely run into 
this issue, with and without new drives being added, so we are tackling both 
use cases.

Have these customers tried out HDFS-1804? We've had users complain about 
imbalance before too, and after enabling HDFS-1804, no further issues. This is 
why I'm trying to separate the two usecases; heterogeneous disks are better 
addressed by HDFS-1804 since it works automatically, leaving hotswap to this 
JIRA (HDFS-1312).

If you're interested, feel free to take HDFS-8538 from me. I really think it'll 
fix the majority of imbalance issues outside of hotswap.

bq. (self-quote) I think most of this functionality should live in the DN since 
it's better equipped to do IO throttling and mutual exclusion.

I think I was too brief before, let me expand a little. I imagine basically 
everything (discover, planning, execute) happening in the DN:

# Client sends RPC to DN telling it to balance with some parameters
# DN examines its volumes, constructs some {{Info}} object to hold the data
# DN thread calls Planner passing the {{Info}} object, which outputs a {{Plan}}
# {{Plan}} is queued at DN executor pool, which does the moves

Attempting to address your points one by one:

# When I mentioned removing the discover phase, I meant the NN communication. 
Here, the DN just probes its own volume information. Does it need to talk to 
the NN for anything else?
# Assuming no need for NN communication, there's no new code. The new RPCs 
would be one to start balancing and one to monitor ongoing balancing, the rest 
of the communication happens between DN threads.
# Cluster-wide disk information is already handled by monitoring tools, no? The 
admin gets the Ganglia alert saying some node is imbalanced, admin triggers 
intranode balancer, admin keeps looking at Ganglia to see if it's fixed. I 
don't think adding our own monitoring of the same information helps, when 
Ganglia etc. are already available, in-use, and understood by admins.
# I don't think this conflicts with the debuggability goal. The DN can dump the 
{{Info}} object (and even the {{Plan}}) object) if requested, to the log or 
somewhere in a data dir. Then we can pass it into a unit test to debug. This 
unit test doesn't need to be a minicluster either, if we write the Planner 
correctly the {{Info}} should encapsulate all the state and we're just running 
an algorithm to make a {{Plan}}. The planner being inside the DN doesn't change 
this.

Thanks for the pointer to the mover logic, wasn't aware we had that. I asked 
about this since the proposal doc in 4.2 says "copy block from A to B and 
verify". Adding a note that says "we use the existing moveBlockAcrossStorage 
method" is a great answer.

> Re-balance disks within a Datanode
> ----------------------------------
>
>                 Key: HDFS-1312
>                 URL: https://issues.apache.org/jira/browse/HDFS-1312
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode
>            Reporter: Travis Crawford
>            Assignee: Anu Engineer
>         Attachments: Architecture_and_testplan.pdf, disk-balancer-proposal.pdf
>
>
> Filing this issue in response to ``full disk woes`` on hdfs-user.
> Datanodes fill their storage directories unevenly, leading to situations 
> where certain disks are full while others are significantly less used. Users 
> at many different sites have experienced this issue, and HDFS administrators 
> are taking steps like:
> - Manually rebalancing blocks in storage directories
> - Decomissioning nodes & later readding them
> There's a tradeoff between making use of all available spindles, and filling 
> disks at the sameish rate. Possible solutions include:
> - Weighting less-used disks heavier when placing new blocks on the datanode. 
> In write-heavy environments this will still make use of all spindles, 
> equalizing disk use over time.
> - Rebalancing blocks locally. This would help equalize disk use as disks are 
> added/replaced in older cluster nodes.
> Datanodes should actively manage their local disk so operator intervention is 
> not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-1312) Re-balance disks within a Datanode

Reply via email to