Re: HDFS read/write data throttling

Andrew Wang Mon, 18 Nov 2013 13:26:58 -0800

https://issues.apache.org/jira/browse/HDFS-5499



On Mon, Nov 18, 2013 at 10:46 AM, Jay Vyas <jayunit...@gmail.com> wrote:

> Where is the jira for this?
>
> Sent from my iPhone
>
> > On Nov 18, 2013, at 1:25 PM, Andrew Wang <andrew.w...@cloudera.com>
> wrote:
> >
> > Thanks for asking, here's a link:
> >
> > http://www.umbrant.com/papers/socc12-cake.pdf
> >
> > I don't think there's a recording of my talk unfortunately.
> >
> > I'll also copy my comments over to the JIRA, though I'd like to not
> > distract too much from what Lohit's trying to do.
> >
> >
> > On Wed, Nov 13, 2013 at 2:54 AM, Steve Loughran <ste...@hortonworks.com
> >wrote:
> >
> >> this is interesting -I've moved my comments over to the JIRA and it
> would
> >> be good for yours to go there too.
> >>
> >> is there a URL for your paper?
> >>
> >>
> >>> On 13 November 2013 06:27, Andrew Wang <andrew.w...@cloudera.com>
> wrote:
> >>>
> >>> Hey Steve,
> >>>
> >>> My research project (Cake, published at SoCC '12) was trying to provide
> >>> SLAs for mixed workloads of latency-sensitive and throughput-bound
> >>> applications, e.g. HBase running alongside MR. This was challenging
> >> because
> >>> seeks are a real killer. Basically, we had to strongly limit MR I/O to
> >> keep
> >>> worst-case seek latency down, and did so by putting schedulers on the
> RPC
> >>> queues in HBase and HDFS to restrict queuing in the OS and disk where
> we
> >>> lacked preemption.
> >>>
> >>> Regarding citations of note, most academics consider throughput-sharing
> >> to
> >>> be a solved problem. It's not dissimilar from normal time slicing, you
> >> try
> >>> to ensure fairness over some coarse timescale. I think cgroups [1] and
> >>> ioprio_set [2] essentially provide this.
> >>>
> >>> Mixing throughput and latency though is difficult, and my conclusion is
> >>> that there isn't a really great solution for spinning disks besides
> >>> physical isolation. As we all know, you can get either IOPS or
> bandwidth,
> >>> but not both, and it's not a linear tradeoff between the two. If you're
> >>> interested in this though, I can dig up some related work from my Cake
> >>> paper.
> >>>
> >>> However, since it seems that we're more concerned with throughput-bound
> >>> apps, we might be okay just using cgroups and ioprio_set to do
> >>> time-slicing. I actually hacked up some code a while ago which passed a
> >>> client-provided priority byte to the DN, which used it to set the I/O
> >>> priority of the handling DataXceiver accordingly. This isn't the most
> >>> outlandish idea, since we've put QoS fields in our RPC protocol for
> >>> instance; this would just be another byte. Short-circuit reads are
> >> outside
> >>> this paradigm, but then you can use cgroup controls instead.
> >>>
> >>> My casual conversations with Googlers indicate that there isn't any
> >> special
> >>> Borg/Omega sauce either, just that they heavily prioritize DFS I/O over
> >>> non-DFS. Maybe that's another approach: if we can separate block
> >> management
> >>> in HDFS, MR tasks could just write their output to a raw HDFS block,
> thus
> >>> bringing a lot of I/O back into the fold of "datanode as I/O manager"
> >> for a
> >>> machine.
> >>>
> >>> Overall, I strongly agree with you that it's important to first define
> >> what
> >>> our goals are regarding I/O QoS. The general case is a tarpit, so it'd
> be
> >>> good to carve off useful things that can be done now (like Lohit's
> >>> direction of per-stream/FS throughput throttling with trusted clients)
> >> and
> >>> then carefully grow the scope as we find more usecases we can
> confidently
> >>> solve.
> >>>
> >>> Best,
> >>> Andrew
> >>>
> >>> [1] cgroups blkio controller
> >>> https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt
> >>> [2] ioprio_set http://man7.org/linux/man-pages/man2/ioprio_set.2.html
> >>>
> >>>
> >>> On Tue, Nov 12, 2013 at 1:38 AM, Steve Loughran <
> ste...@hortonworks.com
> >>>> wrote:
> >>>
> >>>> I've looked at it a bit within the context of YARN.
> >>>>
> >>>> YARN containers are where this would be ideal, as then you'd be able
> to
> >>>> request IO capacity as well as CPU and RAM. For that to work, the
> >>>> throttling would have to be outside the App, as you are trying to
> limit
> >>>> code whether or not it wants to be, and because you probably (*) want
> >> to
> >>>> give it more bandwidth if the system is otherwise idle.
> Self-throttling
> >>>> doesn't pick up spare IO
> >>>>
> >>>>
> >>>>   1. you can use cgroups in YARN to throttle local disk IO through the
> >>>>   file:// URLs or the java filesystem APIs -such as for MR temp data
> >>>>   2. you can't c-group throttle HDFS per YARN container, which would
> >> be
> >>>>   the ideal use case for it. The IO is taking place in the DN, and
> >>> cgroups
> >>>>   only limits IO in the throttled process group.
> >>>>   3. implementing it in the DN would require a lot more complex code
> >>> there
> >>>>   to prioritise work based on block ID (sole identifier that goes
> >> around
> >>>>   everywhere) or input source (local sockets for HBase IO vs TCP
> >> stack)
> >>>>   4. One you go to a heterogenous filesystem you need to think about
> >> IO
> >>>>   load per storage layer as well as/alongside per-volume
> >>>>   5. There's also generic RPC request throttle to prevent DoS against
> >>> the
> >>>>   NN and other HDFS services. That would need to be server side, but
> >>> once
> >>>>   implemented in the RPC code be universal.
> >>>>
> >>>> You also need to define what is the load you are trying to throttle,
> >> pure
> >>>> RPCs/second, read bandwidth, write bandwidth, seeks or IOPs. Once a
> >> file
> >>> is
> >>>> lined up for sequential reading, you'd almost want it to stream
> through
> >>> the
> >>>> next blocks until a high priority request came through, but operations
> >>> like
> >>>> a seek which would involve a disk head movement backwards would be
> >>>> something to throttle (hence you need to be storage type aware as SSD
> >>> seeks
> >>>> costs less). You also need to consider that although the cost of
> writes
> >>> is
> >>>> high, it's usually being done with the goal of preserving data -and
> you
> >>>> don't want to impact durability.
> >>>>
> >>>> (*) probably, because that's one of the issues that causes debates in
> >>> other
> >>>> datacentre platforms, such as Google Omega: do you want max cluster
> >>>> utilisation vs max determinism of workload.
> >>>>
> >>>> If someone were to do IOP throttling in the 3.x+ timeline,
> >>>>
> >>>>   1. It needs clear use cases, YARN containers being #1 for me
> >>>>   2. We'd have to look at all the research done on this in the past to
> >>> see
> >>>>   what works, doesn't
> >>>>
> >>>> Andrew, what citations of relevance do you have?
> >>>>
> >>>> -steve
> >>>>
> >>>>
> >>>>> On 12 November 2013 04:24, lohit <lohit.vijayar...@gmail.com> wrote:
> >>>>>
> >>>>> 2013/11/11 Andrew Wang <andrew.w...@cloudera.com>
> >>>>>
> >>>>>> Hey Lohit,
> >>>>>>
> >>>>>> This is an interesting topic, and something I actually worked on in
> >>>> grad
> >>>>>> school before coming to Cloudera. It'd help if you could outline
> >> some
> >>>> of
> >>>>>> your usecases and how per-FileSystem throttling would help. For
> >> what
> >>> I
> >>>>> was
> >>>>>> doing, it made more sense to throttle on the DN side since you
> >> have a
> >>>>>> better view over all the I/O happening on the system, and you have
> >>>>>> knowledge of different volumes so you can set limits per-disk. This
> >>>> still
> >>>>>> isn't 100% reliable though since normally a portion of each disk is
> >>>> used
> >>>>>> for MR scratch space, which the DN doesn't have control over. I
> >> tried
> >>>>>> playing with thread I/O priorities here, but didn't see much
> >>>> improvement.
> >>>>>> Maybe the newer cgroups stuff can help out.
> >>>>>
> >>>>> Thanks. Yes, we also thought about having something on DataNode. This
> >>>> would
> >>>>> also mean one could easily throttle client who access from outside
> >> the
> >>>>> cluster, for example distcp or hftp copies. Clients need not worry
> >>> about
> >>>>> throttle configs and each cluster can control how much much
> >> throughput
> >>>> can
> >>>>> be achieved. We do want to have something like this.
> >>>>>
> >>>>>>
> >>>>>> I'm sure per-FileSystem throttling will have some benefits (and
> >>>> probably
> >>>>> be
> >>>>>> easier than some DN-side implementation) but again, it'd help to
> >>> better
> >>>>>> understand the problem you are trying to solve.
> >>>>>
> >>>>> One idea was flexibility for client to override and have value they
> >> can
> >>>>> set. For on trusted cluster we could allow clients to go beyond
> >> default
> >>>>> value for some usecases. Alternatively we also thought about having
> >>>> default
> >>>>> value and max value where clients could change default, but not go
> >>> beyond
> >>>>> default. Another problem with DN side config is having different
> >> values
> >>>> for
> >>>>> different clients and easily changing those for selective clients.
> >>>>>
> >>>>> As, Haosong also suggested we could wrap
> >> FSDataOutputStream/FSDataInput
> >>>>> stream with ThrottleInputStream. But we might have to be careful of
> >> any
> >>>>> code which uses FileSystem APIs and accidentally throttling itself.
> >>> (like
> >>>>> reducer copy,  distributed cache and such...)
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Best,
> >>>>>> Andrew
> >>>>>>
> >>>>>>
> >>>>>> On Mon, Nov 11, 2013 at 6:16 PM, Haosong Huang <haosd...@gmail.com
> >>>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Hi, lohit. There is a Class named
> >>>>>>> ThrottledInputStream<
> >>
> http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
> >>>>>>> in hadoop-distcp, you could check it out and find more details.
> >>>>>>>
> >>>>>>> In addition to this, I am working on this and try to achieve
> >>>> resources
> >>>>>>> control(include CPU, Network, Disk IO) in JVM. But my
> >>> implementation
> >>>> is
> >>>>>>> depends on cgroup, which only could run in Linux. I would push my
> >>>>>>> library(java-cgroup) to github in the next several months. If you
> >>> are
> >>>>>>> interested at it, give my any advices and help me improve it
> >>> please.
> >>>>> :-)
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Nov 12, 2013 at 3:47 AM, lohit <
> >> lohit.vijayar...@gmail.com
> >>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Adam,
> >>>>>>>>
> >>>>>>>> Thanks for the reply. The changes I was referring was in
> >>>>>> FileSystem.java
> >>>>>>>> layer which should not affect HDFS Replication/NameNode
> >>> operations.
> >>>>>>>> To give better idea this would affect clients something like
> >> this
> >>>>>>>>
> >>>>>>>> Configuration conf = new Configuration();
> >>>>>>>> conf.setInt("read.bandwitdh.mbpersec", 20); // 20MB/s
> >>>>>>>> FileSystem fs = FileSystem.get(conf);
> >>>>>>>>
> >>>>>>>> FSDataInputStream fis = fs.open("/path/to/file.xt");
> >>>>>>>> fis.read(); // <-- This would be max of 20MB/s
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 2013/11/11 Adam Muise <amu...@hortonworks.com>
> >>>>>>>>
> >>>>>>>>> See https://issues.apache.org/jira/browse/HDFS-3475
> >>>>>>>>>
> >>>>>>>>> Please note that this has met with many unexpected impacts on
> >>>>>> workload.
> >>>>>>>> Be
> >>>>>>>>> careful and be mindful of your Datanode memory and network
> >>>>> capacity.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Nov 11, 2013 at 1:59 PM, lohit <
> >>>> lohit.vijayar...@gmail.com
> >>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hello Devs,
> >>>>>>>>>>
> >>>>>>>>>> Wanted to reach out and see if anyone has thought about
> >>> ability
> >>>>> to
> >>>>>>>>> throttle
> >>>>>>>>>> data transfer within HDFS. One option we have been thinking
> >>> is
> >>>> to
> >>>>>>>>> throttle
> >>>>>>>>>> on a per FileSystem basis, similar to Statistics in
> >>> FileSystem.
> >>>>>> This
> >>>>>>>>> would
> >>>>>>>>>> mean anyone with handle to HDFS/Hftp will be throttled
> >>> globally
> >>>>>>> within
> >>>>>>>>> JVM.
> >>>>>>>>>> Right value to come up for this would be based on type of
> >>>>> hardware
> >>>>>> we
> >>>>>>>> use
> >>>>>>>>>> and how many tasks/clients we allow.
> >>>>>>>>>>
> >>>>>>>>>> On the other hand doing something like this at FileSystem
> >>> layer
> >>>>>> would
> >>>>>>>>> mean
> >>>>>>>>>> many other tasks such as Job jar copy, DistributedCache
> >> copy
> >>>> and
> >>>>>> any
> >>>>>>>>> hidden
> >>>>>>>>>> data movement would also be throttled. We wanted to know if
> >>>>> anyone
> >>>>>>> has
> >>>>>>>>> had
> >>>>>>>>>> such requirement on their clusters in the past and what was
> >>> the
> >>>>>>>> thinking
> >>>>>>>>>> around it. Appreciate your inputs/comments
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Have a Nice Day!
> >>>>>>>>>> Lohit
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>   * Adam Muise *       Solutions Engineer
> >>>>>>>>> ------------------------------
> >>>>>>>>>
> >>>>>>>>>    Phone:        416-417-4037
> >>>>>>>>>  Email:      amu...@hortonworks.com
> >>>>>>>>>  Website:   http://www.hortonworks.com/
> >>>>>>>>>
> >>>>>>>>>      * Follow Us: *
> >>>>>>>>> <
> >>
> http://facebook.com/hortonworks/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >>>>>>>>> <
> >>
> http://twitter.com/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >>>>>>>>> <
> >>
> http://www.linkedin.com/company/hortonworks?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >>>>>>>>>
> >>>>>>>>> [image: photo]
> >>>>>>>>>
> >>>>>>>>>  Latest From Our Blog:  How to use R and other non-Java
> >>>> languages
> >>>>> in
> >>>>>>>>> MapReduce and Hive
> >>>>>>>>> <
> >>
> http://hortonworks.com/blog/using-r-and-other-non-java-languages-in-mapreduce-and-hive/?utm_source=WiseStamp&utm_medium=email&utm_term=&utm_content=&utm_campaign=signature
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> CONFIDENTIALITY NOTICE
> >>>>>>>>> NOTICE: This message is intended for the use of the
> >> individual
> >>> or
> >>>>>>> entity
> >>>>>>>> to
> >>>>>>>>> which it is addressed and may contain information that is
> >>>>>> confidential,
> >>>>>>>>> privileged and exempt from disclosure under applicable law.
> >> If
> >>>> the
> >>>>>>> reader
> >>>>>>>>> of this message is not the intended recipient, you are hereby
> >>>>>> notified
> >>>>>>>> that
> >>>>>>>>> any printing, copying, dissemination, distribution,
> >> disclosure
> >>> or
> >>>>>>>>> forwarding of this communication is strictly prohibited. If
> >> you
> >>>>> have
> >>>>>>>>> received this communication in error, please contact the
> >> sender
> >>>>>>>> immediately
> >>>>>>>>> and delete it from your system. Thank You.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Have a Nice Day!
> >>>>>>>> Lohit
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Best Regards,
> >>>>>>> Haosdent Huang
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Have a Nice Day!
> >>>>> Lohit
> >>>>
> >>>> --
> >>>> CONFIDENTIALITY NOTICE
> >>>> NOTICE: This message is intended for the use of the individual or
> >> entity
> >>> to
> >>>> which it is addressed and may contain information that is
> confidential,
> >>>> privileged and exempt from disclosure under applicable law. If the
> >> reader
> >>>> of this message is not the intended recipient, you are hereby notified
> >>> that
> >>>> any printing, copying, dissemination, distribution, disclosure or
> >>>> forwarding of this communication is strictly prohibited. If you have
> >>>> received this communication in error, please contact the sender
> >>> immediately
> >>>> and delete it from your system. Thank You.
> >>
> >> --
> >> CONFIDENTIALITY NOTICE
> >> NOTICE: This message is intended for the use of the individual or
> entity to
> >> which it is addressed and may contain information that is confidential,
> >> privileged and exempt from disclosure under applicable law. If the
> reader
> >> of this message is not the intended recipient, you are hereby notified
> that
> >> any printing, copying, dissemination, distribution, disclosure or
> >> forwarding of this communication is strictly prohibited. If you have
> >> received this communication in error, please contact the sender
> immediately
> >> and delete it from your system. Thank You.
> >>
>

Re: HDFS read/write data throttling

Reply via email to