[ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419575#comment-13419575
 ] 

Todd Lipcon commented on HDFS-3672:
-----------------------------------

Hey Suresh. I'll try to answer a few of your questions above from the 
perspective of HBase and MR.

The information is useful for clients when they have several tasks to complete 
which involve reading blocks on a given DataNode, but the order of the tasks 
doesn't matter. One example is in HBase: we currently have several compaction 
threads running inside the region server, and those compaction threads do a lot 
of IO. HBase could do a better job of scheduling the compactions if it knew 
which blocks were actually on the same underlying disk. If two blocks are on 
separate disks, you can get 2x the throughput by reading them at the same time, 
whereas if they're on the  same disk, it would be better to schedule them one 
after the other.

You can imagine this feature also being used at some point by MapReduce. 
Consider a map-only job which reads hundreds of blocks located on the same DN. 
When the associated NodeManager asks for a task to run, the application master 
can look at the already-running tasks on that node, understand which disks are 
currently not being read, and schedule a task which accesses an idle disk. 
Another MR use case is to keep track of which local disks the various tasks are 
reading from, and de-prioritize those disks when choosing which local disk on 
which to spill map output to avoid read-write contention.

The other motivation is to eventually correlate these disk IDs with 
statistics/metrics within advanced clients. In HBase, for example, we currently 
always read from the local replica if it is available. If, however, one of the 
local disks is going bad, this can really impact latency, and we'd rather read 
a remote replica instead - the network latency is much less than the cost of 
accessing failing media. But we need to be able to look at a block and know 
which disk it's on in order to track these statistics.

The overall guiding motivation is that we looked at heavily loaded clusters 
with 12 disks and found that we were suffering from pretty significant 
"hotspotting" of disk access. During any given second, about two thirds of the 
disks tend to be at 100% utilization while the others are basically idle. Using 
lsof to look at the number of open blocks on each data volume showed the same 
hotspotting: some disks had multiple tasks reading data whereas others had 
none. With a bit more client visibility into block<->disk correspondence, we 
can try to improve this.

bq. As regards NN knowing about this information, that is one of the 
motivations of HDFS-2832. If each storage volume that corresponds to a disk on 
Datanode has a separate storage ID, NN gets block reports and other stats per 
disk.

I agree HDFS-2832 will really be useful for this. But it's a larger 
restructuring with much bigger implications. This JIRA is just about adding a 
new API which exposes some information that's already available. We explicitly 
chose to make the "disk ID" opaque in the proposed API -- that way when 
HDFS-2832 arrives, we can really easily switch over the _implementation_ to be 
based on the storage IDs without breaking users of the API.
                
> Expose disk-location information for blocks to enable better scheduling
> -----------------------------------------------------------------------
>
>                 Key: HDFS-3672
>                 URL: https://issues.apache.org/jira/browse/HDFS-3672
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-3672-1.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to