[jira] [Commented] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

Arun C Murthy (JIRA) Wed, 01 Aug 2012 06:19:09 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-3672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13426598#comment-13426598
 ]


Arun C Murthy commented on HDFS-3672:
-------------------------------------

Todd - first of all, no one is *blocking* anything. 

bq. Hey Suresh. I'll try to answer a few of your questions above from the 
perspective of HBase and MR.

This jira was started with the premise that this new *feature* was useful to 
MapReduce and HBase (http://s.apache.org/NJY). So, I assumed there would be 
some work in that direction. 

If that was the case I don't see how doing the suggestion to do the work in a 
dev-branch before merging to mainline is *blocking* anything? It is something 
we have done many times over for YARN, HDFS HA etc. etc.

Personally, if anyone was doing this work on MR, I'd be very interested in 
collaborating, heck - *learning*. 

However, given my experience on MR, I'd classify it as a high-risk, but very, 
very interesting research since on a mid-sized clusters (few hundred nodes) and 
beyond the scheduling overhead might more than negate the I/O gains. Hence, 
again, doing that in a dev-branch is absolutely the right thing to do from a 
project and risk management perspective.

bq. This isn't the first time an API has been added to the trunk code before 
downstream users exist. 

Yes, this wouldn't be the first time we made *that* mistake. 

Clearly, we are dealing with the consequences of our previous mistakes for a 
while now. Arguing *that* is a good reason to do the same, again, is not cogent.

bq.  As I mentioned above, we have at least one customer who would like to use 
this feature in their code to get better disk efficiency. They need to run 
against an actual release, not a dev branch build. This is the primary use case 
we're targeting right now. I want to be perfectly honest: the HBase/MR examples 
I gave above are not on our immediate roadmap; they just serve as proof that 
this isn't a one-off/niche improvement.

Now, clearly, you don't plan to do any work on either HBase or MR anytime soon 
and you have a different roadmap for a client.

If you had made that clear sooner, the conversation would be different.

Essentially, for the foreseeable future this will be *dead* code which is not 
going to be beneficial to anyone in the community... yet, the burden of 
maintenance etc. will remain.

No, that is not a big deal since this particular change has a fairly small 
cross-section - it might be harder to make the argument for a future, more 
extensive change of this *kind*. Clearly, if it's a plugin etc., its easier to 
digest.

IAC, I don't wish to debate this further. 

----

Importantly, we should switch this *feature* off by default so that people who 
use this understand that this isn't necessarily supported - at least until we 
have a real, use-case for this in the community.
                
> Expose disk-location information for blocks to enable better scheduling
> -----------------------------------------------------------------------
>
>                 Key: HDFS-3672
>                 URL: https://issues.apache.org/jira/browse/HDFS-3672
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.0.0-alpha
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-3672-1.patch, hdfs-3672-2.patch, hdfs-3672-3.patch, 
> hdfs-3672-4.patch
>
>
> Currently, HDFS exposes on which datanodes a block resides, which allows 
> clients to make scheduling decisions for locality and load balancing. 
> Extending this to also expose on which disk on a datanode a block resides 
> would enable even better scheduling, on a per-disk rather than coarse 
> per-datanode basis.
> This API would likely look similar to Filesystem#getFileBlockLocations, but 
> also involve a series of RPCs to the responsible datanodes to determine disk 
> ids.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-3672) Expose disk-location information for blocks to enable better scheduling

Reply via email to