Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Colin McCabe Wed, 01 Feb 2017 10:30:09 -0800

Hmm.  Maybe I misinterpreted, but I got the impression that Grant was
suggesting that we avoid introducing this concept of "offline replicas"
for now.  Is that feasible?


What is the strategy for declaring a log directory bad?  Is it an
administrative action?  Or is the broker itself going to be responsible
for this?  How do we handle cases where a few disks on a broker are
full, but the others have space?

Are we going to have a disk scanner that will periodically check for
error conditions (similar to the background checks that RAID controllers
do)?  Or will we wait for a failure to happen before declaring a disk
bad?

It seems to me that if we want this to work well we will need to fix
cases in the code where we are suppressing disk errors or ignoring their
root cause.  For example, any place where we are using the old Java APIs
that just return a boolean on failure will need to be fixed, since the
failure could now be disk full, permission denied, or IOE, and we will
need to handle those cases differently.  Also, we will need to harden
the code against disk errors.  Formerly it was OK to just crash on a
disk error; now it is not.  It would be nice to see more in the test
plan about injecting IOExceptions into disk handling code and verifying
that we can handle it correctly.

regards,
Colin


On Wed, Feb 1, 2017, at 10:02, Dong Lin wrote:
> Hey Grant,
> 
> Yes, this KIP does exactly what you described:)
> 
> Thanks,
> Dong
> 
> On Wed, Feb 1, 2017 at 9:45 AM, Grant Henke <ghe...@cloudera.com> wrote:
> 
> > Hi Dong,
> >
> > Thanks for putting this together.
> >
> > Since we are discussing alternative/simplified options. Have you considered
> > handling the disk failures broker side to prevent a crash, marking the disk
> > as "bad" to that individual broker, and continuing as normal? I imagine the
> > broker would then fall out of sync for the replicas hosted on the bad disk
> > and the ISR would shrink. This would allow people using min.isr to keep
> > their data safe and the cluster operators would see a shrink in many ISRs
> > and hopefully an obvious log message leading to a quick fix. I haven't
> > thought through this idea in depth though. So there could be some
> > shortfalls.
> >
> > Thanks,
> > Grant
> >
> >
> >
> > On Wed, Feb 1, 2017 at 11:21 AM, Dong Lin <lindon...@gmail.com> wrote:
> >
> > > Hey Eno,
> > >
> > > Thanks much for the review.
> > >
> > > I think your suggestion is to split disks of a machine into multiple disk
> > > sets and run one broker per disk set. Yeah this is similar to Colin's
> > > suggestion of one-broker-per-disk, which we have evaluated at LinkedIn
> > and
> > > considered it to be a good short term approach.
> > >
> > > As of now I don't think any of these approach is a better alternative in
> > > the long term. I will summarize these here. I have put these reasons in
> > the
> > > KIP's motivation section and rejected alternative section. I am happy to
> > > discuss more and I would certainly like to use an alternative solution
> > that
> > > is easier to do with better performance.
> > >
> > > - JBOD vs. RAID-10: if we switch from RAID-10 with replication-factoer=2
> > to
> > > JBOD with replicatio-factor=3, we get 25% reduction in disk usage and
> > > doubles the tolerance of broker failure before data unavailability from 1
> > > to 2. This is pretty huge gain for any company that uses Kafka at large
> > > scale.
> > >
> > > - JBOD vs. one-broker-per-disk: The benefit of one-broker-per-disk is
> > that
> > > no major code change is needed in Kafka. Among the disadvantage of
> > > one-broker-per-disk summarized in the KIP and previous email with Colin,
> > > the biggest one is the 15% throughput loss compared to JBOD and less
> > > flexibility to balance across disks. Further, it probably requires change
> > > to internal deployment tools at various companies to deal with
> > > one-broker-per-disk setup.
> > >
> > > - JBOD vs. RAID-0: This is the setup that used at Microsoft. The problem
> > is
> > > that a broker becomes unavailable if any disk fail. Suppose
> > > replication-factor=2 and there are 10 disks per machine. Then the
> > > probability of of any message becomes unavailable due to disk failure
> > with
> > > RAID-0 is 100X higher than that with JBOD.
> > >
> > > - JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk is somewhere
> > > between one-broker-per-disk and RAID-0. So it carries an averaged
> > > disadvantages of these two approaches.
> > >
> > > To answer your question regarding, I think it is reasonable to mange disk
> > > in Kafka. By "managing disks" we mean the management of assignment of
> > > replicas across disks. Here are my reasons in more detail:
> > >
> > > - I don't think this KIP is a big step change. By allowing user to
> > > configure Kafka to run multiple log directories or disks as of now, it is
> > > implicit that Kafka manages disks. It is just not a complete feature.
> > > Microsoft and probably other companies are using this feature under the
> > > undesirable effect that a broker will fail any if any disk fail. It is
> > good
> > > to complete this feature.
> > >
> > > - I think it is reasonable to manage disk in Kafka. One of the most
> > > important work that Kafka is doing is to determine the replica assignment
> > > across brokers and make sure enough copies of a given replica is
> > available.
> > > I would argue that it is not much different than determining the replica
> > > assignment across disk conceptually.
> > >
> > > - I would agree that this KIP is improve performance of Kafka at the cost
> > > of more complexity inside Kafka, by switching from RAID-10 to JBOD. I
> > would
> > > argue that this is a right direction. If we can gain 20%+ performance by
> > > managing NIC in Kafka as compared to existing approach and other
> > > alternatives, I would say we should just do it. Such a gain in
> > performance,
> > > or equivalently reduction in cost, can save millions of dollars per year
> > > for any company running Kafka at large scale.
> > >
> > > Thanks,
> > > Dong
> > >
> > >
> > > On Wed, Feb 1, 2017 at 5:41 AM, Eno Thereska <eno.there...@gmail.com>
> > > wrote:
> > >
> > > > I'm coming somewhat late to the discussion, apologies for that.
> > > >
> > > > I'm worried about this proposal. It's moving Kafka to a world where it
> > > > manages disks. So in a sense, the scope of the KIP is limited, but the
> > > > direction it sets for Kafka is quite a big step change. Fundamentally
> > > this
> > > > is about balancing resources for a Kafka broker. This can be done by a
> > > > tool, rather than by changing Kafka. E.g., the tool would take a bunch
> > of
> > > > disks together, create a volume over them and export that to a Kafka
> > > broker
> > > > (in addition to setting the memory limits for that broker or limiting
> > > other
> > > > resources). A different bunch of disks can then make up a second
> > volume,
> > > > and be used by another Kafka broker. This is aligned with what Colin is
> > > > saying (as I understand it).
> > > >
> > > > Disks are not the only resource on a machine, there are several
> > instances
> > > > where multiple NICs are used for example. Do we want fine grained
> > > > management of all these resources? I'd argue that opens us the system
> > to
> > > a
> > > > lot of complexity.
> > > >
> > > > Thanks
> > > > Eno
> > > >
> > > >
> > > > > On 1 Feb 2017, at 01:53, Dong Lin <lindon...@gmail.com> wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I am going to initiate the vote If there is no further concern with
> > the
> > > > KIP.
> > > > >
> > > > > Thanks,
> > > > > Dong
> > > > >
> > > > >
> > > > > On Fri, Jan 27, 2017 at 8:08 PM, radai <radai.rosenbl...@gmail.com>
> > > > wrote:
> > > > >
> > > > >> a few extra points:
> > > > >>
> > > > >> 1. broker per disk might also incur more client <--> broker sockets:
> > > > >> suppose every producer / consumer "talks" to >1 partition, there's a
> > > > very
> > > > >> good chance that partitions that were co-located on a single 10-disk
> > > > broker
> > > > >> would now be split between several single-disk broker processes on
> > the
> > > > same
> > > > >> machine. hard to put a multiplier on this, but likely >x1. sockets
> > > are a
> > > > >> limited resource at the OS level and incur some memory cost (kernel
> > > > >> buffers)
> > > > >>
> > > > >> 2. there's a memory overhead to spinning up a JVM (compiled code and
> > > > byte
> > > > >> code objects etc). if we assume this overhead is ~300 MB (order of
> > > > >> magnitude, specifics vary) than spinning up 10 JVMs would lose you 3
> > > GB
> > > > of
> > > > >> RAM. not a ton, but non negligible.
> > > > >>
> > > > >> 3. there would also be some overhead downstream of kafka in any
> > > > management
> > > > >> / monitoring / log aggregation system. likely less than x10 though.
> > > > >>
> > > > >> 4. (related to above) - added complexity of administration with more
> > > > >> running instances.
> > > > >>
> > > > >> is anyone running kafka with anywhere near 100GB heaps? i thought
> > the
> > > > point
> > > > >> was to rely on kernel page cache to do the disk buffering ....
> > > > >>
> > > > >> On Thu, Jan 26, 2017 at 11:00 AM, Dong Lin <lindon...@gmail.com>
> > > wrote:
> > > > >>
> > > > >>> Hey Colin,
> > > > >>>
> > > > >>> Thanks much for the comment. Please see me comment inline.
> > > > >>>
> > > > >>> On Thu, Jan 26, 2017 at 10:15 AM, Colin McCabe <cmcc...@apache.org
> > >
> > > > >> wrote:
> > > > >>>
> > > > >>>> On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote:
> > > > >>>>> Hey Colin,
> > > > >>>>>
> > > > >>>>> Good point! Yeah we have actually considered and tested this
> > > > >> solution,
> > > > >>>>> which we call one-broker-per-disk. It would work and should
> > require
> > > > >> no
> > > > >>>>> major change in Kafka as compared to this JBOD KIP. So it would
> > be
> > > a
> > > > >>> good
> > > > >>>>> short term solution.
> > > > >>>>>
> > > > >>>>> But it has a few drawbacks which makes it less desirable in the
> > > long
> > > > >>>>> term.
> > > > >>>>> Assume we have 10 disks on a machine. Here are the problems:
> > > > >>>>
> > > > >>>> Hi Dong,
> > > > >>>>
> > > > >>>> Thanks for the thoughtful reply.
> > > > >>>>
> > > > >>>>>
> > > > >>>>> 1) Our stress test result shows that one-broker-per-disk has 15%
> > > > >> lower
> > > > >>>>> throughput
> > > > >>>>>
> > > > >>>>> 2) Controller would need to send 10X as many LeaderAndIsrRequest,
> > > > >>>>> MetadataUpdateRequest and StopReplicaRequest. This increases the
> > > > >> burden
> > > > >>>>> on
> > > > >>>>> controller which can be the performance bottleneck.
> > > > >>>>
> > > > >>>> Maybe I'm misunderstanding something, but there would not be 10x
> > as
> > > > >> many
> > > > >>>> StopReplicaRequest RPCs, would there?  The other requests would
> > > > >> increase
> > > > >>>> 10x, but from a pretty low base, right?  We are not reassigning
> > > > >>>> partitions all the time, I hope (or else we have bigger
> > problems...)
> > > > >>>>
> > > > >>>
> > > > >>> I think the controller will group StopReplicaRequest per broker and
> > > > send
> > > > >>> only one StopReplicaRequest to a broker during controlled shutdown.
> > > > >> Anyway,
> > > > >>> we don't have to worry about this if we agree that other requests
> > > will
> > > > >>> increase by 10X. One MetadataRequest to send to each broker in the
> > > > >> cluster
> > > > >>> every time there is leadership change. I am not sure this is a real
> > > > >>> problem. But in theory this makes the overhead complexity O(number
> > of
> > > > >>> broker) and may be a concern in the future. Ideally we should avoid
> > > it.
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>>>
> > > > >>>>> 3) Less efficient use of physical resource on the machine. The
> > > number
> > > > >>> of
> > > > >>>>> socket on each machine will increase by 10X. The number of
> > > connection
> > > > >>>>> between any two machine will increase by 100X.
> > > > >>>>>
> > > > >>>>> 4) Less efficient way to management memory and quota.
> > > > >>>>>
> > > > >>>>> 5) Rebalance between disks/brokers on the same machine will less
> > > > >>>>> efficient
> > > > >>>>> and less flexible. Broker has to read data from another broker on
> > > the
> > > > >>>>> same
> > > > >>>>> machine via socket. It is also harder to do automatic load
> > balance
> > > > >>>>> between
> > > > >>>>> disks on the same machine in the future.
> > > > >>>>>
> > > > >>>>> I will put this and the explanation in the rejected alternative
> > > > >>> section.
> > > > >>>>> I
> > > > >>>>> have a few questions:
> > > > >>>>>
> > > > >>>>> - Can you explain why this solution can help avoid scalability
> > > > >>>>> bottleneck?
> > > > >>>>> I actually think it will exacerbate the scalability problem due
> > the
> > > > >> 2)
> > > > >>>>> above.
> > > > >>>>> - Why can we push more RPC with this solution?
> > > > >>>>
> > > > >>>> To really answer this question we'd have to take a deep dive into
> > > the
> > > > >>>> locking of the broker and figure out how effectively it can
> > > > parallelize
> > > > >>>> truly independent requests.  Almost every multithreaded process is
> > > > >> going
> > > > >>>> to have shared state, like shared queues or shared sockets, that
> > is
> > > > >>>> going to make scaling less than linear when you add disks or
> > > > >> processors.
> > > > >>>> (And clearly, another option is to improve that scalability,
> > rather
> > > > >>>> than going multi-process!)
> > > > >>>>
> > > > >>>
> > > > >>> Yeah I also think it is better to improve scalability inside kafka
> > > code
> > > > >> if
> > > > >>> possible. I am not sure we currently have any scalability issue
> > > inside
> > > > >>> Kafka that can not be removed without using multi-process.
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>>> - It is true that a garbage collection in one broker would not
> > > affect
> > > > >>>>> others. But that is after every broker only uses 1/10 of the
> > > memory.
> > > > >>> Can
> > > > >>>>> we be sure that this will actually help performance?
> > > > >>>>
> > > > >>>> The big question is, how much memory do Kafka brokers use now, and
> > > how
> > > > >>>> much will they use in the future?  Our experience in HDFS was that
> > > > once
> > > > >>>> you start getting more than 100-200GB Java heap sizes, full GCs
> > > start
> > > > >>>> taking minutes to finish when using the standard JVMs.  That alone
> > > is
> > > > a
> > > > >>>> good reason to go multi-process or consider storing more things
> > off
> > > > the
> > > > >>>> Java heap.
> > > > >>>>
> > > > >>>
> > > > >>> I see. Now I agree one-broker-per-disk should be more efficient in
> > > > terms
> > > > >> of
> > > > >>> GC since each broker probably needs less than 1/10 of the memory
> > > > >> available
> > > > >>> on a typical machine nowadays. I will remove this from the reason
> > of
> > > > >>> rejection.
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>> Disk failure is the "easy" case.  The "hard" case, which is
> > > > >>>> unfortunately also the much more common case, is disk misbehavior.
> > > > >>>> Towards the end of their lives, disks tend to start slowing down
> > > > >>>> unpredictably.  Requests that would have completed immediately
> > > before
> > > > >>>> start taking 20, 100 500 milliseconds.  Some files may be readable
> > > and
> > > > >>>> other files may not be.  System calls hang, sometimes forever, and
> > > the
> > > > >>>> Java process can't abort them, because the hang is in the kernel.
> > > It
> > > > >> is
> > > > >>>> not fun when threads are stuck in "D state"
> > > > >>>> http://stackoverflow.com/questions/20423521/process-perminan
> > > > >>>> tly-stuck-on-d-state
> > > > >>>> .  Even kill -9 cannot abort the thread then.  Fortunately, this
> > is
> > > > >>>> rare.
> > > > >>>>
> > > > >>>
> > > > >>> I agree it is a harder problem and it is rare. We probably don't
> > have
> > > > to
> > > > >>> worry about it in this KIP since this issue is orthogonal to
> > whether
> > > or
> > > > >> not
> > > > >>> we use JBOD.
> > > > >>>
> > > > >>>
> > > > >>>>
> > > > >>>> Another approach we should consider is for Kafka to implement its
> > > own
> > > > >>>> storage layer that would stripe across multiple disks.  This
> > > wouldn't
> > > > >>>> have to be done at the block level, but could be done at the file
> > > > >> level.
> > > > >>>> We could use consistent hashing to determine which disks a file
> > > should
> > > > >>>> end up on, for example.
> > > > >>>>
> > > > >>>
> > > > >>> Are you suggesting that we should distribute log, or log segment,
> > > > across
> > > > >>> disks of brokers? I am not sure if I fully understand this
> > approach.
> > > My
> > > > >> gut
> > > > >>> feel is that this would be a drastic solution that would require
> > > > >>> non-trivial design. While this may be useful to Kafka, I would
> > prefer
> > > > not
> > > > >>> to discuss this in detail in this thread unless you believe it is
> > > > >> strictly
> > > > >>> superior to the design in this KIP in terms of solving our
> > use-case.
> > > > >>>
> > > > >>>
> > > > >>>> best,
> > > > >>>> Colin
> > > > >>>>
> > > > >>>>>
> > > > >>>>> Thanks,
> > > > >>>>> Dong
> > > > >>>>>
> > > > >>>>> On Wed, Jan 25, 2017 at 11:34 AM, Colin McCabe <
> > cmcc...@apache.org
> > > >
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Hi Dong,
> > > > >>>>>>
> > > > >>>>>> Thanks for the writeup!  It's very interesting.
> > > > >>>>>>
> > > > >>>>>> I apologize in advance if this has been discussed somewhere
> > else.
> > > > >>> But
> > > > >>>> I
> > > > >>>>>> am curious if you have considered the solution of running
> > multiple
> > > > >>>>>> brokers per node.  Clearly there is a memory overhead with this
> > > > >>>> solution
> > > > >>>>>> because of the fixed cost of starting multiple JVMs.  However,
> > > > >>> running
> > > > >>>>>> multiple JVMs would help avoid scalability bottlenecks.  You
> > could
> > > > >>>>>> probably push more RPCs per second, for example.  A garbage
> > > > >>> collection
> > > > >>>>>> in one broker would not affect the others.  It would be
> > > interesting
> > > > >>> to
> > > > >>>>>> see this considered in the "alternate designs" design, even if
> > you
> > > > >>> end
> > > > >>>>>> up deciding it's not the way to go.
> > > > >>>>>>
> > > > >>>>>> best,
> > > > >>>>>> Colin
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On Thu, Jan 12, 2017, at 10:46, Dong Lin wrote:
> > > > >>>>>>> Hi all,
> > > > >>>>>>>
> > > > >>>>>>> We created KIP-112: Handle disk failure for JBOD. Please find
> > the
> > > > >>> KIP
> > > > >>>>>>> wiki
> > > > >>>>>>> in the link https://cwiki.apache.org/confl
> > > > >> uence/display/KAFKA/KIP-
> > > > >>>>>>> 112%3A+Handle+disk+failure+for+JBOD.
> > > > >>>>>>>
> > > > >>>>>>> This KIP is related to KIP-113
> > > > >>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > >>>>>> 113%3A+Support+replicas+movement+between+log+directories>:
> > > > >>>>>>> Support replicas movement between log directories. They are
> > > > >> needed
> > > > >>> in
> > > > >>>>>>> order
> > > > >>>>>>> to support JBOD in Kafka. Please help review the KIP. You
> > > > >> feedback
> > > > >>> is
> > > > >>>>>>> appreciated!
> > > > >>>>>>>
> > > > >>>>>>> Thanks,
> > > > >>>>>>> Dong
> > > > >>>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Grant Henke
> > Software Engineer | Cloudera
> > gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to