https://issues.apache.org/jira/browse/HDFS-5499
On Mon, Nov 18, 2013 at 10:46 AM, Jay Vyas <jayunit...@gmail.com> wrote: > Where is the jira for this? > > Sent from my iPhone > > > On Nov 18, 2013, at 1:25 PM, Andrew Wang <andrew.w...@cloudera.com> > wrote: > > > > Thanks for asking, here's a link: > > > > http://www.umbrant.com/papers/socc12-cake.pdf > > > > I don't think there's a recording of my talk unfortunately. > > > > I'll also copy my comments over to the JIRA, though I'd like to not > > distract too much from what Lohit's trying to do. > > > > > > On Wed, Nov 13, 2013 at 2:54 AM, Steve Loughran <ste...@hortonworks.com > >wrote: > > > >> this is interesting -I've moved my comments over to the JIRA and it > would > >> be good for yours to go there too. > >> > >> is there a URL for your paper? > >> > >> > >>> On 13 November 2013 06:27, Andrew Wang <andrew.w...@cloudera.com> > wrote: > >>> > >>> Hey Steve, > >>> > >>> My research project (Cake, published at SoCC '12) was trying to provide > >>> SLAs for mixed workloads of latency-sensitive and throughput-bound > >>> applications, e.g. HBase running alongside MR. This was challenging > >> because > >>> seeks are a real killer. Basically, we had to strongly limit MR I/O to > >> keep > >>> worst-case seek latency down, and did so by putting schedulers on the > RPC > >>> queues in HBase and HDFS to restrict queuing in the OS and disk where > we > >>> lacked preemption. > >>> > >>> Regarding citations of note, most academics consider throughput-sharing > >> to > >>> be a solved problem. It's not dissimilar from normal time slicing, you > >> try > >>> to ensure fairness over some coarse timescale. I think cgroups [1] and > >>> ioprio_set [2] essentially provide this. > >>> > >>> Mixing throughput and latency though is difficult, and my conclusion is > >>> that there isn't a really great solution for spinning disks besides > >>> physical isolation. As we all know, you can get either IOPS or > bandwidth, > >>> but not both, and it's not a linear tradeoff between the two. If you're > >>> interested in this though, I can dig up some related work from my Cake > >>> paper. > >>> > >>> However, since it seems that we're more concerned with throughput-bound > >>> apps, we might be okay just using cgroups and ioprio_set to do > >>> time-slicing. I actually hacked up some code a while ago which passed a > >>> client-provided priority byte to the DN, which used it to set the I/O > >>> priority of the handling DataXceiver accordingly. This isn't the most > >>> outlandish idea, since we've put QoS fields in our RPC protocol for > >>> instance; this would just be another byte. Short-circuit reads are > >> outside > >>> this paradigm, but then you can use cgroup controls instead. > >>> > >>> My casual conversations with Googlers indicate that there isn't any > >> special > >>> Borg/Omega sauce either, just that they heavily prioritize DFS I/O over > >>> non-DFS. Maybe that's another approach: if we can separate block > >> management > >>> in HDFS, MR tasks could just write their output to a raw HDFS block, > thus > >>> bringing a lot of I/O back into the fold of "datanode as I/O manager" > >> for a > >>> machine. > >>> > >>> Overall, I strongly agree with you that it's important to first define > >> what > >>> our goals are regarding I/O QoS. The general case is a tarpit, so it'd > be > >>> good to carve off useful things that can be done now (like Lohit's > >>> direction of per-stream/FS throughput throttling with trusted clients) > >> and > >>> then carefully grow the scope as we find more usecases we can > confidently > >>> solve. > >>> > >>> Best, > >>> Andrew > >>> > >>> [1] cgroups blkio controller > >>> https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt > >>> [2] ioprio_set http://man7.org/linux/man-pages/man2/ioprio_set.2.html > >>> > >>> > >>> On Tue, Nov 12, 2013 at 1:38 AM, Steve Loughran < > ste...@hortonworks.com > >>>> wrote: > >>> > >>>> I've looked at it a bit within the context of YARN. > >>>> > >>>> YARN containers are where this would be ideal, as then you'd be able > to > >>>> request IO capacity as well as CPU and RAM. For that to work, the > >>>> throttling would have to be outside the App, as you are trying to > limit > >>>> code whether or not it wants to be, and because you probably (*) want > >> to > >>>> give it more bandwidth if the system is otherwise idle. > Self-throttling > >>>> doesn't pick up spare IO > >>>> > >>>> > >>>> 1. you can use cgroups in YARN to throttle local disk IO through the > >>>> file:// URLs or the java filesystem APIs -such as for MR temp data > >>>> 2. you can't c-group throttle HDFS per YARN container, which would > >> be > >>>> the ideal use case for it. The IO is taking place in the DN, and > >>> cgroups > >>>> only limits IO in the throttled process group. > >>>> 3. implementing it in the DN would require a lot more complex code > >>> there > >>>> to prioritise work based on block ID (sole identifier that goes > >> around > >>>> everywhere) or input source (local sockets for HBase IO vs TCP > >> stack) > >>>> 4. One you go to a heterogenous filesystem you need to think about > >> IO > >>>> load per storage layer as well as/alongside per-volume > >>>> 5. There's also generic RPC request throttle to prevent DoS against > >>> the > >>>> NN and other HDFS services. That would need to be server side, but > >>> once > >>>> implemented in the RPC code be universal. > >>>> > >>>> You also need to define what is the load you are trying to throttle, > >> pure > >>>> RPCs/second, read bandwidth, write bandwidth, seeks or IOPs. Once a > >> file > >>> is > >>>> lined up for sequential reading, you'd almost want it to stream > through > >>> the > >>>> next blocks until a high priority request came through, but operations > >>> like > >>>> a seek which would involve a disk head movement backwards would be > >>>> something to throttle (hence you need to be storage type aware as SSD > >>> seeks > >>>> costs less). You also need to consider that although the cost of > writes > >>> is > >>>> high, it's usually being done with the goal of preserving data -and > you > >>>> don't want to impact durability. > >>>> > >>>> (*) probably, because that's one of the issues that causes debates in > >>> other > >>>> datacentre platforms, such as Google Omega: do you want max cluster > >>>> utilisation vs max determinism of workload. > >>>> > >>>> If someone were to do IOP throttling in the 3.x+ timeline, > >>>> > >>>> 1. It needs clear use cases, YARN containers being #1 for me > >>>> 2. We'd have to look at all the research done on this in the past to > >>> see > >>>> what works, doesn't > >>>> > >>>> Andrew, what citations of relevance do you have? > >>>> > >>>> -steve > >>>> > >>>> > >>>>> On 12 November 2013 04:24, lohit <lohit.vijayar...@gmail.com> wrote: > >>>>> > >>>>> 2013/11/11 Andrew Wang <andrew.w...@cloudera.com> > >>>>> > >>>>>> Hey Lohit, > >>>>>> > >>>>>> This is an interesting topic, and something I actually worked on in > >>>> grad > >>>>>> school before coming to Cloudera. It'd help if you could outline > >> some > >>>> of > >>>>>> your usecases and how per-FileSystem throttling would help. For > >> what > >>> I > >>>>> was > >>>>>> doing, it made more sense to throttle on the DN side since you > >> have a > >>>>>> better view over all the I/O happening on the system, and you have > >>>>>> knowledge of different volumes so you can set limits per-disk. This > >>>> still > >>>>>> isn't 100% reliable though since normally a portion of each disk is > >>>> used > >>>>>> for MR scratch space, which the DN doesn't have control over. I > >> tried > >>>>>> playing with thread I/O priorities here, but didn't see much > >>>> improvement. > >>>>>> Maybe the newer cgroups stuff can help out. > >>>>> > >>>>> Thanks. Yes, we also thought about having something on DataNode. This > >>>> would > >>>>> also mean one could easily throttle client who access from outside > >> the > >>>>> cluster, for example distcp or hftp copies. Clients need not worry > >>> about > >>>>> throttle configs and each cluster can control how much much > >> throughput > >>>> can > >>>>> be achieved. We do want to have something like this. > >>>>> > >>>>>> > >>>>>> I'm sure per-FileSystem throttling will have some benefits (and > >>>> probably > >>>>> be > >>>>>> easier than some DN-side implementation) but again, it'd help to > >>> better > >>>>>> understand the problem you are trying to solve. > >>>>> > >>>>> One idea was flexibility for client to override and have value they > >> can > >>>>> set. For on trusted cluster we could allow clients to go beyond > >> default > >>>>> value for some usecases. Alternatively we also thought about having > >>>> default > >>>>> value and max value where clients could change default, but not go > >>> beyond > >>>>> default. Another problem with DN side config is having different > >> values > >>>> for > >>>>> different clients and easily changing those for selective clients. > >>>>> > >>>>> As, Haosong also suggested we could wrap > >> FSDataOutputStream/FSDataInput > >>>>> stream with ThrottleInputStream. But we might have to be careful of > >> any > >>>>> code which uses FileSystem APIs and accidentally throttling itself. > >>> (like > >>>>> reducer copy, distributed cache and such...) > >>>>> > >>>>> > >>>>> > >>>>>> Best, > >>>>>> Andrew > >>>>>> > >>>>>> > >>>>>> On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <haosd...@gmail.com > >>> > >>>>> wrote: > >>>>>> > >>>>>>> Hi, lohit. There is a Class named > >>>>>>> ThrottledInputStream< > >> > http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java > >>>>>>> in hadoop-distcp, you could check it out and find more details. > >>>>>>> > >>>>>>> In addition to this, I am working on this and try to achieve > >>>> resources > >>>>>>> control(include CPU, Network, Disk IO) in JVM. But my > >>> implementation > >>>> is > >>>>>>> depends on cgroup, which only could run in Linux. I would push my > >>>>>>> library(java-cgroup) to github in the next several months. If you > >>> are > >>>>>>> interested at it, give my any advices and help me improve it > >>> please. > >>>>> :-) > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Nov 12, 2013 at 3:47 AM, lohit < > >> lohit.vijayar...@gmail.com > >>>> > >>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Adam, > >>>>>>>> > >>>>>>>> Thanks for the reply. The changes I was referring was in > >>>>>> FileSystem.java > >>>>>>>> layer which should not affect HDFS Replication/NameNode > >>> operations. > >>>>>>>> To give better idea this would affect clients something like > >> this > >>>>>>>> > >>>>>>>> Configuration conf = new Configuration(); > >>>>>>>> conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s > >>>>>>>> FileSystem fs = FileSystem.get(conf); > >>>>>>>> > >>>>>>>> FSDataInputStream fis = fs.open("/path/to/file.xt"); > >>>>>>>> fis.read(); // <-- This would be max of 20MB/s > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> 2013/11/11 Adam Muise <amu...@hortonworks.com> > >>>>>>>> > >>>>>>>>> See https://issues.apache.org/jira/browse/HDFS-3475 > >>>>>>>>> > >>>>>>>>> Please note that this has met with many unexpected impacts on > >>>>>> workload. > >>>>>>>> Be > >>>>>>>>> careful and be mindful of your Datanode memory and network > >>>>> capacity. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Nov 11, 2013 at 1:59 PM, lohit < > >>>> lohit.vijayar...@gmail.com > >>>>>> > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hello Devs, > >>>>>>>>>> > >>>>>>>>>> Wanted to reach out and see if anyone has thought about > >>> ability > >>>>> to > >>>>>>>>> throttle > >>>>>>>>>> data transfer within HDFS. One option we have been thinking > >>> is > >>>> to > >>>>>>>>> throttle > >>>>>>>>>> on a per FileSystem basis, similar to Statistics in > >>> FileSystem. > >>>>>> This > >>>>>>>>> would > >>>>>>>>>> mean anyone with handle to HDFS/Hftp will be throttled > >>> globally > >>>>>>> within > >>>>>>>>> JVM. > >>>>>>>>>> Right value to come up for this would be based on type of > >>>>> hardware > >>>>>> we > >>>>>>>> use > >>>>>>>>>> and how many tasks/clients we allow. > >>>>>>>>>> > >>>>>>>>>> On the other hand doing something like this at FileSystem > >>> layer > >>>>>> would > >>>>>>>>> mean > >>>>>>>>>> many other tasks such as Job jar copy, DistributedCache > >> copy > >>>> and > >>>>>> any > >>>>>>>>> hidden > >>>>>>>>>> data movement would also be throttled. We wanted to know if > >>>>> anyone > >>>>>>> has > >>>>>>>>> had > >>>>>>>>>> such requirement on their clusters in the past and what was > >>> the > >>>>>>>> thinking > >>>>>>>>>> around it. Appreciate your inputs/comments > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Have a Nice Day! > >>>>>>>>>> Lohit > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> * Adam Muise * Solutions Engineer > >>>>>>>>> ------------------------------ > >>>>>>>>> > >>>>>>>>> Phone: 416-417-4037 > >>>>>>>>> Email: amu...@hortonworks.com > >>>>>>>>> Website: http://www.hortonworks.com/ > >>>>>>>>> > >>>>>>>>> * Follow Us: * > >>>>>>>>> < > >> > http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature > >>>>>>>>> < > >> > http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature > >>>>>>>>> < > >> > http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature > >>>>>>>>> > >>>>>>>>> [image: photo] > >>>>>>>>> > >>>>>>>>> Latest From Our Blog: How to use R and other non-Java > >>>> languages > >>>>> in > >>>>>>>>> MapReduce and Hive > >>>>>>>>> < > >> > http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>>> CONFIDENTIALITY NOTICE > >>>>>>>>> NOTICE: This message is intended for the use of the > >> individual > >>> or > >>>>>>> entity > >>>>>>>> to > >>>>>>>>> which it is addressed and may contain information that is > >>>>>> confidential, > >>>>>>>>> privileged and exempt from disclosure under applicable law. > >> If > >>>> the > >>>>>>> reader > >>>>>>>>> of this message is not the intended recipient, you are hereby > >>>>>> notified > >>>>>>>> that > >>>>>>>>> any printing, copying, dissemination, distribution, > >> disclosure > >>> or > >>>>>>>>> forwarding of this communication is strictly prohibited. If > >> you > >>>>> have > >>>>>>>>> received this communication in error, please contact the > >> sender > >>>>>>>> immediately > >>>>>>>>> and delete it from your system. Thank You. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Have a Nice Day! > >>>>>>>> Lohit > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Best Regards, > >>>>>>> Haosdent Huang > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Have a Nice Day! > >>>>> Lohit > >>>> > >>>> -- > >>>> CONFIDENTIALITY NOTICE > >>>> NOTICE: This message is intended for the use of the individual or > >> entity > >>> to > >>>> which it is addressed and may contain information that is > confidential, > >>>> privileged and exempt from disclosure under applicable law. If the > >> reader > >>>> of this message is not the intended recipient, you are hereby notified > >>> that > >>>> any printing, copying, dissemination, distribution, disclosure or > >>>> forwarding of this communication is strictly prohibited. If you have > >>>> received this communication in error, please contact the sender > >>> immediately > >>>> and delete it from your system. Thank You. > >> > >> -- > >> CONFIDENTIALITY NOTICE > >> NOTICE: This message is intended for the use of the individual or > entity to > >> which it is addressed and may contain information that is confidential, > >> privileged and exempt from disclosure under applicable law. If the > reader > >> of this message is not the intended recipient, you are hereby notified > that > >> any printing, copying, dissemination, distribution, disclosure or > >> forwarding of this communication is strictly prohibited. If you have > >> received this communication in error, please contact the sender > immediately > >> and delete it from your system. Thank You. > >> >