Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Dong Lin Thu, 09 Feb 2017 16:47:54 -0800

On Thu, Feb 9, 2017 at 3:37 PM, Colin McCabe <cmcc...@apache.org> wrote:


> On Thu, Feb 9, 2017, at 11:40, Dong Lin wrote:
> > Thanks for all the comments Colin!
> >
> > To answer your questions:
> > - Yes, a broker will shutdown if all its log directories are bad.
>
> That makes sense.  Can you add this to the writeup?
>

Sure. This has already been added. You can find it here
<https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=67638402&selectedPageVersions=9&selectedPageVersions=10>
.


>
> > - I updated the KIP to explicitly state that a log directory will be
> > assumed to be good until broker sees IOException when it tries to access
> > the log directory.
>
> Thanks.
>
> > - Controller doesn't explicitly know whether there is new log directory
> > or
> > not. All controller knows is whether replicas are online or offline based
> > on LeaderAndIsrResponse. According to the existing Kafka implementation,
> > controller will always send LeaderAndIsrRequest to a broker after it
> > bounces.
>
> I thought so.  It's good to clarify, though.  Do you think it's worth
> adding a quick discussion of this on the wiki?
>

Personally I don't think it is needed. If broker starts with no bad log
directory, everything should work it is and we should not need to clarify
it. The KIP has already covered the scenario when a broker starts with bad
log directory. Also, the KIP doesn't claim or hint that we support dynamic
addition of new log directories. I think we are good.


> best,
> Colin
>
> >
> > Please see this
> > <https://cwiki.apache.org/confluence/pages/diffpagesbyversio
> n.action?pageId=67638402&selectedPageVersions=9&selectedPageVersions=10>
> > for the change of the KIP.
> >
> > On Thu, Feb 9, 2017 at 11:04 AM, Colin McCabe <cmcc...@apache.org>
> wrote:
> >
> > > On Thu, Feb 9, 2017, at 11:03, Colin McCabe wrote:
> > > > Thanks, Dong L.
> > > >
> > > > Do we plan on bringing down the broker process when all log
> directories
> > > > are offline?
> > > >
> > > > Can you explicitly state on the KIP that the log dirs are all
> considered
> > > > good after the broker process is bounced?  It seems like an important
> > > > thing to be clear about.  Also, perhaps discuss how the controller
> > > > becomes aware of the newly good log directories after a broker bounce
> > > > (and whether this triggers re-election).
> > >
> > > I meant to write, all the log dirs where the broker can still read the
> > > index and some other files.  Clearly, log dirs that are completely
> > > inaccessible will still be considered bad after a broker process
> bounce.
> > >
> > > best,
> > > Colin
> > >
> > > >
> > > > +1 (non-binding) aside from that
> > > >
> > > >
> > > >
> > > > On Wed, Feb 8, 2017, at 00:47, Dong Lin wrote:
> > > > > Hi all,
> > > > >
> > > > > Thank you all for the helpful suggestion. I have updated the KIP to
> > > > > address
> > > > > the comments received so far. See here
> > > > > <https://cwiki.apache.org/confluence/pages/diffpagesbyversio
> n.action?
> > > pageId=67638402&selectedPageVersions=8&selectedPageVersions=9>to
> > > > > read the changes of the KIP. Here is a summary of change:
> > > > >
> > > > > - Updated the Proposed Change section to change the recovery steps.
> > > After
> > > > > this change, broker will also create replica as long as all log
> > > > > directories
> > > > > are working.
> > > > > - Removed kafka-log-dirs.sh from this KIP since user no longer
> needs to
> > > > > use
> > > > > it for recovery from bad disks.
> > > > > - Explained how the znode controller_managed_state is managed in
> the
> > > > > Public
> > > > > interface section.
> > > > > - Explained what happens during controller failover, partition
> > > > > reassignment
> > > > > and topic deletion in the Proposed Change section.
> > > > > - Updated Future Work section to include the following potential
> > > > > improvements
> > > > >   - Let broker notify controller of ISR change and disk state
> change
> > > via
> > > > > RPC instead of using zookeeper
> > > > >   - Handle various failure scenarios (e.g. slow disk) on a
> case-by-case
> > > > > basis. For example, we may want to detect slow disk and consider
> it as
> > > > > offline.
> > > > >   - Allow admin to mark a directory as bad so that it will not be
> used.
> > > > >
> > > > > Thanks,
> > > > > Dong
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Feb 7, 2017 at 5:23 PM, Dong Lin <lindon...@gmail.com>
> wrote:
> > > > >
> > > > > > Hey Eno,
> > > > > >
> > > > > > Thanks much for the comment!
> > > > > >
> > > > > > I still think the complexity added to Kafka is justified by its
> > > benefit.
> > > > > > Let me provide my reasons below.
> > > > > >
> > > > > > 1) The additional logic is easy to understand and thus its
> complexity
> > > > > > should be reasonable.
> > > > > >
> > > > > > On the broker side, it needs to catch exception when access log
> > > directory,
> > > > > > mark log directory and all its replicas as offline, notify
> > > controller by
> > > > > > writing the zookeeper notification path, and specify error in
> > > > > > LeaderAndIsrResponse. On the controller side, it will listener to
> > > > > > zookeeper for disk failure notification, learn about offline
> > > replicas in
> > > > > > the LeaderAndIsrResponse, and take offline replicas into
> > > consideration when
> > > > > > electing leaders. It also mark replica as created in zookeeper
> and
> > > use it
> > > > > > to determine whether a replica is created.
> > > > > >
> > > > > > That is all the logic we need to add in Kafka. I personally feel
> > > this is
> > > > > > easy to reason about.
> > > > > >
> > > > > > 2) The additional code is not much.
> > > > > >
> > > > > > I expect the code for KIP-112 to be around 1100 lines new code.
> > > Previously
> > > > > > I have implemented a prototype of a slightly different design
> (see
> > > here
> > > > > > <https://docs.google.com/document/d/1Izza0SBmZMVUBUt9s_
> > > -Dqi3D8e0KGJQYW8xgEdRsgAI/edit>)
> > > > > > and uploaded it to github (see here
> > > > > > <https://github.com/lindong28/kafka/tree/JBOD>). The patch
> changed
> > > 33
> > > > > > files, added 1185 lines and deleted 183 lines. The size of
> prototype
> > > patch
> > > > > > is actually smaller than patch of KIP-107 (see here
> > > > > > <https://github.com/apache/kafka/pull/2476>) which is already
> > > accepted.
> > > > > > The KIP-107 patch changed 49 files, added 1349 lines and deleted
> 141
> > > lines.
> > > > > >
> > > > > > 3) Comparison with one-broker-per-multiple-volumes
> > > > > >
> > > > > > This KIP can improve the availability of Kafka in this case such
> > > that one
> > > > > > failed volume doesn't bring down the entire broker.
> > > > > >
> > > > > > 4) Comparison with one-broker-per-volume
> > > > > >
> > > > > > If each volume maps to multiple disks, then we still have similar
> > > problem
> > > > > > such that the broker will fail if any disk of the volume failed.
> > > > > >
> > > > > > If each volume maps to one disk, it means that we need to deploy
> 10
> > > > > > brokers on a machine if the machine has 10 disks. I will explain
> the
> > > > > > concern with this approach in order of their importance.
> > > > > >
> > > > > > - It is weird if we were to tell kafka user to deploy 50 brokers
> on a
> > > > > > machine of 50 disks.
> > > > > >
> > > > > > - Either when user deploys Kafka on a commercial cloud platform
> or
> > > when
> > > > > > user deploys their own cluster, the size or largest disk is
> usually
> > > > > > limited. There will be scenarios where user want to increase
> broker
> > > > > > capacity by having multiple disks per broker. This JBOD KIP
> makes it
> > > > > > feasible without hurting availability due to single disk failure.
> > > > > >
> > > > > > - Automatic load rebalance across disks will be easier and more
> > > flexible
> > > > > > if one broker has multiple disks. This can be future work.
> > > > > >
> > > > > > - There is performance concern when you deploy 10 broker vs. 1
> > > broker on
> > > > > > one machine. The metadata the cluster, including FetchRequest,
> > > > > > ProduceResponse, MetadataRequest and so on will all be 10X more.
> The
> > > > > > packet-per-second will be 10X higher which may limit performance
> if
> > > pps is
> > > > > > the performance bottleneck. The number of socket on the machine
> is
> > > 10X
> > > > > > higher. And the number of replication thread will be 100X more.
> The
> > > impact
> > > > > > will be more significant with increasing number of disks per
> > > machine. Thus
> > > > > > it will limit Kakfa's scalability in the long term.
> > > > > >
> > > > > > Thanks,
> > > > > > Dong
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 7, 2017 at 1:51 AM, Eno Thereska <
> eno.there...@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > >> Hi Dong,
> > > > > >>
> > > > > >> To simplify the discussion today, on my part I'll zoom into one
> > > thing
> > > > > >> only:
> > > > > >>
> > > > > >> - I'll discuss the options called below : "one-broker-per-disk"
> or
> > > > > >> "one-broker-per-few-disks".
> > > > > >>
> > > > > >> - I completely buy the JBOD vs RAID arguments so there is no
> need to
> > > > > >> discuss that part for me. I buy it that JBODs are good.
> > > > > >>
> > > > > >> I find the terminology can be improved a bit. Ideally we'd be
> > > talking
> > > > > >> about volumes, not disks. Just to make it clear that Kafka
> > > understand
> > > > > >> volumes/directories, not individual raw disks. So by
> > > > > >> "one-broker-per-few-disks" what I mean is that the admin can
> pool a
> > > few
> > > > > >> disks together to create a volume/directory and give that to
> Kafka.
> > > > > >>
> > > > > >>
> > > > > >> The kernel of my question will be that the admin already has
> tools
> > > to 1)
> > > > > >> create volumes/directories from a JBOD and 2) start a broker on
> a
> > > desired
> > > > > >> machine and 3) assign a broker resources like a directory. I
> claim
> > > that
> > > > > >> those tools are sufficient to optimise resource allocation.  I
> > > understand
> > > > > >> that a broker could manage point 3) itself, ie juggle the
> > > directories. My
> > > > > >> question is whether the complexity added to Kafka is justified.
> > > > > >> Operationally it seems to me an admin will still have to do all
> the
> > > three
> > > > > >> items above.
> > > > > >>
> > > > > >> Looking forward to the discussion
> > > > > >> Thanks
> > > > > >> Eno
> > > > > >>
> > > > > >>
> > > > > >> > On 1 Feb 2017, at 17:21, Dong Lin <lindon...@gmail.com>
> wrote:
> > > > > >> >
> > > > > >> > Hey Eno,
> > > > > >> >
> > > > > >> > Thanks much for the review.
> > > > > >> >
> > > > > >> > I think your suggestion is to split disks of a machine into
> > > multiple
> > > > > >> disk
> > > > > >> > sets and run one broker per disk set. Yeah this is similar to
> > > Colin's
> > > > > >> > suggestion of one-broker-per-disk, which we have evaluated at
> > > LinkedIn
> > > > > >> and
> > > > > >> > considered it to be a good short term approach.
> > > > > >> >
> > > > > >> > As of now I don't think any of these approach is a better
> > > alternative in
> > > > > >> > the long term. I will summarize these here. I have put these
> > > reasons in
> > > > > >> the
> > > > > >> > KIP's motivation section and rejected alternative section. I
> am
> > > happy to
> > > > > >> > discuss more and I would certainly like to use an alternative
> > > solution
> > > > > >> that
> > > > > >> > is easier to do with better performance.
> > > > > >> >
> > > > > >> > - JBOD vs. RAID-10: if we switch from RAID-10 with
> > > > > >> replication-factoer=2 to
> > > > > >> > JBOD with replicatio-factor=3, we get 25% reduction in disk
> usage
> > > and
> > > > > >> > doubles the tolerance of broker failure before data
> > > unavailability from
> > > > > >> 1
> > > > > >> > to 2. This is pretty huge gain for any company that uses
> Kafka at
> > > large
> > > > > >> > scale.
> > > > > >> >
> > > > > >> > - JBOD vs. one-broker-per-disk: The benefit of
> > > one-broker-per-disk is
> > > > > >> that
> > > > > >> > no major code change is needed in Kafka. Among the
> disadvantage of
> > > > > >> > one-broker-per-disk summarized in the KIP and previous email
> with
> > > Colin,
> > > > > >> > the biggest one is the 15% throughput loss compared to JBOD
> and
> > > less
> > > > > >> > flexibility to balance across disks. Further, it probably
> requires
> > > > > >> change
> > > > > >> > to internal deployment tools at various companies to deal with
> > > > > >> > one-broker-per-disk setup.
> > > > > >> >
> > > > > >> > - JBOD vs. RAID-0: This is the setup that used at Microsoft.
> The
> > > > > >> problem is
> > > > > >> > that a broker becomes unavailable if any disk fail. Suppose
> > > > > >> > replication-factor=2 and there are 10 disks per machine. Then
> the
> > > > > >> > probability of of any message becomes unavailable due to disk
> > > failure
> > > > > >> with
> > > > > >> > RAID-0 is 100X higher than that with JBOD.
> > > > > >> >
> > > > > >> > - JBOD vs. one-broker-per-few-disks: one-broker-per-few-disk
> is
> > > > > >> somewhere
> > > > > >> > between one-broker-per-disk and RAID-0. So it carries an
> averaged
> > > > > >> > disadvantages of these two approaches.
> > > > > >> >
> > > > > >> > To answer your question regarding, I think it is reasonable to
> > > mange
> > > > > >> disk
> > > > > >> > in Kafka. By "managing disks" we mean the management of
> > > assignment of
> > > > > >> > replicas across disks. Here are my reasons in more detail:
> > > > > >> >
> > > > > >> > - I don't think this KIP is a big step change. By allowing
> user to
> > > > > >> > configure Kafka to run multiple log directories or disks as of
> > > now, it
> > > > > >> is
> > > > > >> > implicit that Kafka manages disks. It is just not a complete
> > > feature.
> > > > > >> > Microsoft and probably other companies are using this feature
> > > under the
> > > > > >> > undesirable effect that a broker will fail any if any disk
> fail.
> > > It is
> > > > > >> good
> > > > > >> > to complete this feature.
> > > > > >> >
> > > > > >> > - I think it is reasonable to manage disk in Kafka. One of the
> > > most
> > > > > >> > important work that Kafka is doing is to determine the replica
> > > > > >> assignment
> > > > > >> > across brokers and make sure enough copies of a given replica
> is
> > > > > >> available.
> > > > > >> > I would argue that it is not much different than determining
> the
> > > replica
> > > > > >> > assignment across disk conceptually.
> > > > > >> >
> > > > > >> > - I would agree that this KIP is improve performance of Kafka
> at
> > > the
> > > > > >> cost
> > > > > >> > of more complexity inside Kafka, by switching from RAID-10 to
> > > JBOD. I
> > > > > >> would
> > > > > >> > argue that this is a right direction. If we can gain 20%+
> > > performance by
> > > > > >> > managing NIC in Kafka as compared to existing approach and
> other
> > > > > >> > alternatives, I would say we should just do it. Such a gain in
> > > > > >> performance,
> > > > > >> > or equivalently reduction in cost, can save millions of
> dollars
> > > per year
> > > > > >> > for any company running Kafka at large scale.
> > > > > >> >
> > > > > >> > Thanks,
> > > > > >> > Dong
> > > > > >> >
> > > > > >> >
> > > > > >> > On Wed, Feb 1, 2017 at 5:41 AM, Eno Thereska <
> > > eno.there...@gmail.com>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >> I'm coming somewhat late to the discussion, apologies for
> that.
> > > > > >> >>
> > > > > >> >> I'm worried about this proposal. It's moving Kafka to a world
> > > where it
> > > > > >> >> manages disks. So in a sense, the scope of the KIP is
> limited,
> > > but the
> > > > > >> >> direction it sets for Kafka is quite a big step change.
> > > Fundamentally
> > > > > >> this
> > > > > >> >> is about balancing resources for a Kafka broker. This can be
> > > done by a
> > > > > >> >> tool, rather than by changing Kafka. E.g., the tool would
> take a
> > > bunch
> > > > > >> of
> > > > > >> >> disks together, create a volume over them and export that to
> a
> > > Kafka
> > > > > >> broker
> > > > > >> >> (in addition to setting the memory limits for that broker or
> > > limiting
> > > > > >> other
> > > > > >> >> resources). A different bunch of disks can then make up a
> second
> > > > > >> volume,
> > > > > >> >> and be used by another Kafka broker. This is aligned with
> what
> > > Colin is
> > > > > >> >> saying (as I understand it).
> > > > > >> >>
> > > > > >> >> Disks are not the only resource on a machine, there are
> several
> > > > > >> instances
> > > > > >> >> where multiple NICs are used for example. Do we want fine
> grained
> > > > > >> >> management of all these resources? I'd argue that opens us
> the
> > > system
> > > > > >> to a
> > > > > >> >> lot of complexity.
> > > > > >> >>
> > > > > >> >> Thanks
> > > > > >> >> Eno
> > > > > >> >>
> > > > > >> >>
> > > > > >> >>> On 1 Feb 2017, at 01:53, Dong Lin <lindon...@gmail.com>
> wrote:
> > > > > >> >>>
> > > > > >> >>> Hi all,
> > > > > >> >>>
> > > > > >> >>> I am going to initiate the vote If there is no further
> concern
> > > with
> > > > > >> the
> > > > > >> >> KIP.
> > > > > >> >>>
> > > > > >> >>> Thanks,
> > > > > >> >>> Dong
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> On Fri, Jan 27, 2017 at 8:08 PM, radai <
> > > radai.rosenbl...@gmail.com>
> > > > > >> >> wrote:
> > > > > >> >>>
> > > > > >> >>>> a few extra points:
> > > > > >> >>>>
> > > > > >> >>>> 1. broker per disk might also incur more client <--> broker
> > > sockets:
> > > > > >> >>>> suppose every producer / consumer "talks" to >1 partition,
> > > there's a
> > > > > >> >> very
> > > > > >> >>>> good chance that partitions that were co-located on a
> single
> > > 10-disk
> > > > > >> >> broker
> > > > > >> >>>> would now be split between several single-disk broker
> > > processes on
> > > > > >> the
> > > > > >> >> same
> > > > > >> >>>> machine. hard to put a multiplier on this, but likely >x1.
> > > sockets
> > > > > >> are a
> > > > > >> >>>> limited resource at the OS level and incur some memory cost
> > > (kernel
> > > > > >> >>>> buffers)
> > > > > >> >>>>
> > > > > >> >>>> 2. there's a memory overhead to spinning up a JVM (compiled
> > > code and
> > > > > >> >> byte
> > > > > >> >>>> code objects etc). if we assume this overhead is ~300 MB
> > > (order of
> > > > > >> >>>> magnitude, specifics vary) than spinning up 10 JVMs would
> lose
> > > you 3
> > > > > >> GB
> > > > > >> >> of
> > > > > >> >>>> RAM. not a ton, but non negligible.
> > > > > >> >>>>
> > > > > >> >>>> 3. there would also be some overhead downstream of kafka
> in any
> > > > > >> >> management
> > > > > >> >>>> / monitoring / log aggregation system. likely less than x10
> > > though.
> > > > > >> >>>>
> > > > > >> >>>> 4. (related to above) - added complexity of administration
> > > with more
> > > > > >> >>>> running instances.
> > > > > >> >>>>
> > > > > >> >>>> is anyone running kafka with anywhere near 100GB heaps? i
> > > thought the
> > > > > >> >> point
> > > > > >> >>>> was to rely on kernel page cache to do the disk buffering
> ....
> > > > > >> >>>>
> > > > > >> >>>> On Thu, Jan 26, 2017 at 11:00 AM, Dong Lin <
> > > lindon...@gmail.com>
> > > > > >> wrote:
> > > > > >> >>>>
> > > > > >> >>>>> Hey Colin,
> > > > > >> >>>>>
> > > > > >> >>>>> Thanks much for the comment. Please see me comment inline.
> > > > > >> >>>>>
> > > > > >> >>>>> On Thu, Jan 26, 2017 at 10:15 AM, Colin McCabe <
> > > cmcc...@apache.org>
> > > > > >> >>>> wrote:
> > > > > >> >>>>>
> > > > > >> >>>>>> On Wed, Jan 25, 2017, at 13:50, Dong Lin wrote:
> > > > > >> >>>>>>> Hey Colin,
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Good point! Yeah we have actually considered and tested
> this
> > > > > >> >>>> solution,
> > > > > >> >>>>>>> which we call one-broker-per-disk. It would work and
> should
> > > > > >> require
> > > > > >> >>>> no
> > > > > >> >>>>>>> major change in Kafka as compared to this JBOD KIP. So
> it
> > > would
> > > > > >> be a
> > > > > >> >>>>> good
> > > > > >> >>>>>>> short term solution.
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> But it has a few drawbacks which makes it less
> desirable in
> > > the
> > > > > >> long
> > > > > >> >>>>>>> term.
> > > > > >> >>>>>>> Assume we have 10 disks on a machine. Here are the
> problems:
> > > > > >> >>>>>>
> > > > > >> >>>>>> Hi Dong,
> > > > > >> >>>>>>
> > > > > >> >>>>>> Thanks for the thoughtful reply.
> > > > > >> >>>>>>
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> 1) Our stress test result shows that one-broker-per-disk
> > > has 15%
> > > > > >> >>>> lower
> > > > > >> >>>>>>> throughput
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> 2) Controller would need to send 10X as many
> > > LeaderAndIsrRequest,
> > > > > >> >>>>>>> MetadataUpdateRequest and StopReplicaRequest. This
> > > increases the
> > > > > >> >>>> burden
> > > > > >> >>>>>>> on
> > > > > >> >>>>>>> controller which can be the performance bottleneck.
> > > > > >> >>>>>>
> > > > > >> >>>>>> Maybe I'm misunderstanding something, but there would
> not be
> > > 10x as
> > > > > >> >>>> many
> > > > > >> >>>>>> StopReplicaRequest RPCs, would there?  The other requests
> > > would
> > > > > >> >>>> increase
> > > > > >> >>>>>> 10x, but from a pretty low base, right?  We are not
> > > reassigning
> > > > > >> >>>>>> partitions all the time, I hope (or else we have bigger
> > > > > >> problems...)
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> I think the controller will group StopReplicaRequest per
> > > broker and
> > > > > >> >> send
> > > > > >> >>>>> only one StopReplicaRequest to a broker during controlled
> > > shutdown.
> > > > > >> >>>> Anyway,
> > > > > >> >>>>> we don't have to worry about this if we agree that other
> > > requests
> > > > > >> will
> > > > > >> >>>>> increase by 10X. One MetadataRequest to send to each
> broker
> > > in the
> > > > > >> >>>> cluster
> > > > > >> >>>>> every time there is leadership change. I am not sure this
> is
> > > a real
> > > > > >> >>>>> problem. But in theory this makes the overhead complexity
> > > O(number
> > > > > >> of
> > > > > >> >>>>> broker) and may be a concern in the future. Ideally we
> should
> > > avoid
> > > > > >> it.
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> 3) Less efficient use of physical resource on the
> machine.
> > > The
> > > > > >> number
> > > > > >> >>>>> of
> > > > > >> >>>>>>> socket on each machine will increase by 10X. The number
> of
> > > > > >> connection
> > > > > >> >>>>>>> between any two machine will increase by 100X.
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> 4) Less efficient way to management memory and quota.
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> 5) Rebalance between disks/brokers on the same machine
> will
> > > less
> > > > > >> >>>>>>> efficient
> > > > > >> >>>>>>> and less flexible. Broker has to read data from another
> > > broker on
> > > > > >> the
> > > > > >> >>>>>>> same
> > > > > >> >>>>>>> machine via socket. It is also harder to do automatic
> load
> > > balance
> > > > > >> >>>>>>> between
> > > > > >> >>>>>>> disks on the same machine in the future.
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> I will put this and the explanation in the rejected
> > > alternative
> > > > > >> >>>>> section.
> > > > > >> >>>>>>> I
> > > > > >> >>>>>>> have a few questions:
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> - Can you explain why this solution can help avoid
> > > scalability
> > > > > >> >>>>>>> bottleneck?
> > > > > >> >>>>>>> I actually think it will exacerbate the scalability
> problem
> > > due
> > > > > >> the
> > > > > >> >>>> 2)
> > > > > >> >>>>>>> above.
> > > > > >> >>>>>>> - Why can we push more RPC with this solution?
> > > > > >> >>>>>>
> > > > > >> >>>>>> To really answer this question we'd have to take a deep
> dive
> > > into
> > > > > >> the
> > > > > >> >>>>>> locking of the broker and figure out how effectively it
> can
> > > > > >> >> parallelize
> > > > > >> >>>>>> truly independent requests.  Almost every multithreaded
> > > process is
> > > > > >> >>>> going
> > > > > >> >>>>>> to have shared state, like shared queues or shared
> sockets,
> > > that is
> > > > > >> >>>>>> going to make scaling less than linear when you add
> disks or
> > > > > >> >>>> processors.
> > > > > >> >>>>>> (And clearly, another option is to improve that
> scalability,
> > > rather
> > > > > >> >>>>>> than going multi-process!)
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> Yeah I also think it is better to improve scalability
> inside
> > > kafka
> > > > > >> code
> > > > > >> >>>> if
> > > > > >> >>>>> possible. I am not sure we currently have any scalability
> > > issue
> > > > > >> inside
> > > > > >> >>>>> Kafka that can not be removed without using multi-process.
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>>> - It is true that a garbage collection in one broker
> would
> > > not
> > > > > >> affect
> > > > > >> >>>>>>> others. But that is after every broker only uses 1/10
> of the
> > > > > >> memory.
> > > > > >> >>>>> Can
> > > > > >> >>>>>>> we be sure that this will actually help performance?
> > > > > >> >>>>>>
> > > > > >> >>>>>> The big question is, how much memory do Kafka brokers use
> > > now, and
> > > > > >> how
> > > > > >> >>>>>> much will they use in the future?  Our experience in HDFS
> > > was that
> > > > > >> >> once
> > > > > >> >>>>>> you start getting more than 100-200GB Java heap sizes,
> full
> > > GCs
> > > > > >> start
> > > > > >> >>>>>> taking minutes to finish when using the standard JVMs.
> That
> > > alone
> > > > > >> is
> > > > > >> >> a
> > > > > >> >>>>>> good reason to go multi-process or consider storing more
> > > things off
> > > > > >> >> the
> > > > > >> >>>>>> Java heap.
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> I see. Now I agree one-broker-per-disk should be more
> > > efficient in
> > > > > >> >> terms
> > > > > >> >>>> of
> > > > > >> >>>>> GC since each broker probably needs less than 1/10 of the
> > > memory
> > > > > >> >>>> available
> > > > > >> >>>>> on a typical machine nowadays. I will remove this from the
> > > reason of
> > > > > >> >>>>> rejection.
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>> Disk failure is the "easy" case.  The "hard" case, which
> is
> > > > > >> >>>>>> unfortunately also the much more common case, is disk
> > > misbehavior.
> > > > > >> >>>>>> Towards the end of their lives, disks tend to start
> slowing
> > > down
> > > > > >> >>>>>> unpredictably.  Requests that would have completed
> > > immediately
> > > > > >> before
> > > > > >> >>>>>> start taking 20, 100 500 milliseconds.  Some files may be
> > > readable
> > > > > >> and
> > > > > >> >>>>>> other files may not be.  System calls hang, sometimes
> > > forever, and
> > > > > >> the
> > > > > >> >>>>>> Java process can't abort them, because the hang is in the
> > > kernel.
> > > > > >> It
> > > > > >> >>>> is
> > > > > >> >>>>>> not fun when threads are stuck in "D state"
> > > > > >> >>>>>> http://stackoverflow.com/quest
> ions/20423521/process-perminan
> > > > > >> >>>>>> tly-stuck-on-d-state
> > > > > >> >>>>>> .  Even kill -9 cannot abort the thread then.
> Fortunately,
> > > this is
> > > > > >> >>>>>> rare.
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> I agree it is a harder problem and it is rare. We probably
> > > don't
> > > > > >> have
> > > > > >> >> to
> > > > > >> >>>>> worry about it in this KIP since this issue is orthogonal
> to
> > > > > >> whether or
> > > > > >> >>>> not
> > > > > >> >>>>> we use JBOD.
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>> Another approach we should consider is for Kafka to
> > > implement its
> > > > > >> own
> > > > > >> >>>>>> storage layer that would stripe across multiple disks.
> This
> > > > > >> wouldn't
> > > > > >> >>>>>> have to be done at the block level, but could be done at
> the
> > > file
> > > > > >> >>>> level.
> > > > > >> >>>>>> We could use consistent hashing to determine which disks
> a
> > > file
> > > > > >> should
> > > > > >> >>>>>> end up on, for example.
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> Are you suggesting that we should distribute log, or log
> > > segment,
> > > > > >> >> across
> > > > > >> >>>>> disks of brokers? I am not sure if I fully understand this
> > > > > >> approach. My
> > > > > >> >>>> gut
> > > > > >> >>>>> feel is that this would be a drastic solution that would
> > > require
> > > > > >> >>>>> non-trivial design. While this may be useful to Kafka, I
> would
> > > > > >> prefer
> > > > > >> >> not
> > > > > >> >>>>> to discuss this in detail in this thread unless you
> believe
> > > it is
> > > > > >> >>>> strictly
> > > > > >> >>>>> superior to the design in this KIP in terms of solving our
> > > use-case.
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>>> best,
> > > > > >> >>>>>> Colin
> > > > > >> >>>>>>
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Thanks,
> > > > > >> >>>>>>> Dong
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> On Wed, Jan 25, 2017 at 11:34 AM, Colin McCabe <
> > > > > >> cmcc...@apache.org>
> > > > > >> >>>>>>> wrote:
> > > > > >> >>>>>>>
> > > > > >> >>>>>>>> Hi Dong,
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> Thanks for the writeup!  It's very interesting.
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> I apologize in advance if this has been discussed
> > > somewhere else.
> > > > > >> >>>>> But
> > > > > >> >>>>>> I
> > > > > >> >>>>>>>> am curious if you have considered the solution of
> running
> > > > > >> multiple
> > > > > >> >>>>>>>> brokers per node.  Clearly there is a memory overhead
> with
> > > this
> > > > > >> >>>>>> solution
> > > > > >> >>>>>>>> because of the fixed cost of starting multiple JVMs.
> > > However,
> > > > > >> >>>>> running
> > > > > >> >>>>>>>> multiple JVMs would help avoid scalability bottlenecks.
> > > You
> > > > > >> could
> > > > > >> >>>>>>>> probably push more RPCs per second, for example.  A
> garbage
> > > > > >> >>>>> collection
> > > > > >> >>>>>>>> in one broker would not affect the others.  It would be
> > > > > >> interesting
> > > > > >> >>>>> to
> > > > > >> >>>>>>>> see this considered in the "alternate designs" design,
> > > even if
> > > > > >> you
> > > > > >> >>>>> end
> > > > > >> >>>>>>>> up deciding it's not the way to go.
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> best,
> > > > > >> >>>>>>>> Colin
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> On Thu, Jan 12, 2017, at 10:46, Dong Lin wrote:
> > > > > >> >>>>>>>>> Hi all,
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> We created KIP-112: Handle disk failure for JBOD.
> Please
> > > find
> > > > > >> the
> > > > > >> >>>>> KIP
> > > > > >> >>>>>>>>> wiki
> > > > > >> >>>>>>>>> in the link https://cwiki.apache.org/confl
> > > > > >> >>>> uence/display/KAFKA/KIP-
> > > > > >> >>>>>>>>> 112%3A+Handle+disk+failure+for+JBOD.
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> This KIP is related to KIP-113
> > > > > >> >>>>>>>>> <https://cwiki.apache.org/conf
> luence/display/KAFKA/KIP-
> > > > > >> >>>>>>>> 113%3A+Support+replicas+moveme
> nt+between+log+directories>:
> > > > > >> >>>>>>>>> Support replicas movement between log directories.
> They
> > > are
> > > > > >> >>>> needed
> > > > > >> >>>>> in
> > > > > >> >>>>>>>>> order
> > > > > >> >>>>>>>>> to support JBOD in Kafka. Please help review the KIP.
> You
> > > > > >> >>>> feedback
> > > > > >> >>>>> is
> > > > > >> >>>>>>>>> appreciated!
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> Thanks,
> > > > > >> >>>>>>>>> Dong
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > > >
> > >
>

Re: [DISCUSS] KIP-112: Handle disk failure for JBOD

Reply via email to