[ 
https://issues.apache.org/jira/browse/HDFS-5499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13825591#comment-13825591
 ] 

Andrew Wang commented on HDFS-5499:
-----------------------------------

Cross-posting my comments from the [mailing-list 
thread|http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-dev/201311.mbox/%3CCA%2B01ahhyyv6K4sQBBfQqFYPnq25_ZYUr2LaCxmC-%3DpcetUxH8A%40mail.gmail.com%3E]
 here just in case. This was in reply to Steve's above comment.

========

My research project (Cake, published at SoCC '12) was trying to provide SLAs 
for mixed workloads of latency-sensitive and throughput-bound applications, 
e.g. HBase running alongside MR. This was challenging because seeks are a real 
killer. Basically, we had to strongly limit MR I/O to keep worst-case seek 
latency down, and did so by putting schedulers on the RPC queues in HBase and 
HDFS to restrict queuing in the OS and disk where we lacked preemption.

Regarding citations of note, most academics consider throughput-sharing to be a 
solved problem. It's not dissimilar from normal time slicing, you try to ensure 
fairness over some coarse timescale. I think cgroups and ioprio_set essentially 
provide this.

Mixing throughput and latency though is difficult, and my conclusion is that 
there isn't a really great solution for spinning disks besides physical 
isolation. As we all know, you can get either IOPS or bandwidth, but not both, 
and it's not a linear tradeoff between the two. If you're interested in this 
though, I can dig up some related work from my Cake paper.

However, since it seems that we're more concerned with throughput-bound apps, 
we might be okay just using cgroups and ioprio_set to do time-slicing. I 
actually hacked up some code a while ago which passed a client-provided 
priority byte to the DN, which used it to set the I/O priority of the handling 
DataXceiver accordingly. This isn't the most outlandish idea, since we've put 
QoS fields in our RPC protocol for instance; this would just be another byte. 
Short-circuit reads are outside this paradigm, but then you can use cgroup 
controls instead.

My casual conversations with Googlers indicate that there isn't any special 
Borg/Omega sauce either, just that they heavily prioritize DFS I/O over 
non-DFS. Maybe that's another approach: if we can separate block management in 
HDFS, MR tasks could just write their output to a raw HDFS block, thus bringing 
a lot of I/O back into the fold of "datanode as I/O manager" for a machine.

Overall, I strongly agree with you that it's important to first define what our 
goals are regarding I/O QoS. The general case is a tarpit, so it'd be good to 
carve off useful things that can be done now (like Lohit's direction of 
per-stream/FS throughput throttling with trusted clients) and then carefully 
grow the scope as we find more usecases we can confidently solve.



> Provide way to throttle per FileSystem read/write bandwidth
> -----------------------------------------------------------
>
>                 Key: HDFS-5499
>                 URL: https://issues.apache.org/jira/browse/HDFS-5499
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Lohit Vijayarenu
>         Attachments: HDFS-5499.1.patch
>
>
> In some cases it might be worth to throttle read/writer bandwidth on per JVM 
> basis so that clients do not spawn too many thread and start data movement 
> causing other JVMs to starve. Ability to throttle read/write bandwidth on per 
> FileSystem would help avoid such issues. 
> Challenge seems to be how well this can be fit into FileSystem code. If one 
> enables throttling around FileSystem APIs, then any hidden data transfer 
> within cluster using them might also be affected. Eg. copying job jar during 
> job submission, localizing resources for distributed cache and such. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to