[ https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419575#comment-13419575 ]
Todd Lipcon commented on HDFS-3672: ----------------------------------- Hey Suresh. I'll try to answer a few of your questions above from the perspective of HBase and MR. The information is useful for clients when they have several tasks to complete which involve reading blocks on a given DataNode, but the order of the tasks doesn't matter. One example is in HBase: we currently have several compaction threads running inside the region server, and those compaction threads do a lot of IO. HBase could do a better job of scheduling the compactions if it knew which blocks were actually on the same underlying disk. If two blocks are on separate disks, you can get 2x the throughput by reading them at the same time, whereas if they're on the same disk, it would be better to schedule them one after the other. You can imagine this feature also being used at some point by MapReduce. Consider a map-only job which reads hundreds of blocks located on the same DN. When the associated NodeManager asks for a task to run, the application master can look at the already-running tasks on that node, understand which disks are currently not being read, and schedule a task which accesses an idle disk. Another MR use case is to keep track of which local disks the various tasks are reading from, and de-prioritize those disks when choosing which local disk on which to spill map output to avoid read-write contention. The other motivation is to eventually correlate these disk IDs with statistics/metrics within advanced clients. In HBase, for example, we currently always read from the local replica if it is available. If, however, one of the local disks is going bad, this can really impact latency, and we'd rather read a remote replica instead - the network latency is much less than the cost of accessing failing media. But we need to be able to look at a block and know which disk it's on in order to track these statistics. The overall guiding motivation is that we looked at heavily loaded clusters with 12 disks and found that we were suffering from pretty significant "hotspotting" of disk access. During any given second, about two thirds of the disks tend to be at 100% utilization while the others are basically idle. Using lsof to look at the number of open blocks on each data volume showed the same hotspotting: some disks had multiple tasks reading data whereas others had none. With a bit more client visibility into block<->disk correspondence, we can try to improve this. bq. As regards NN knowing about this information, that is one of the motivations of HDFS-2832. If each storage volume that corresponds to a disk on Datanode has a separate storage ID, NN gets block reports and other stats per disk. I agree HDFS-2832 will really be useful for this. But it's a larger restructuring with much bigger implications. This JIRA is just about adding a new API which exposes some information that's already available. We explicitly chose to make the "disk ID" opaque in the proposed API -- that way when HDFS-2832 arrives, we can really easily switch over the _implementation_ to be based on the storage IDs without breaking users of the API. > Expose disk-location information for blocks to enable better scheduling > ----------------------------------------------------------------------- > > Key: HDFS-3672 > URL: https://issues.apache.org/jira/browse/HDFS-3672 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 2.0.0-alpha > Reporter: Andrew Wang > Assignee: Andrew Wang > Attachments: hdfs-3672-1.patch > > > Currently, HDFS exposes on which datanodes a block resides, which allows > clients to make scheduling decisions for locality and load balancing. > Extending this to also expose on which disk on a datanode a block resides > would enable even better scheduling, on a per-disk rather than coarse > per-datanode basis. > This API would likely look similar to Filesystem#getFileBlockLocations, but > also involve a series of RPCs to the responsible datanodes to determine disk > ids. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira