Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Dong Lin Mon, 07 Aug 2017 09:56:16 -0700

Hey Tom,

Good question. We have actually considered having DescribeDirsResponse
provide the capacity of each disk as well. This was not included because we
believe Kafka cluster admin will always configure all brokers with same
number of disks of the same size. This is because it is generally easier to
manager a homogeneous cluster. If this is not the case then I think we
should include this information in the response.


Thanks,
Dong


On Mon, Aug 7, 2017 at 3:44 AM, Tom Bentley <t.j.bent...@gmail.com> wrote:

> Hi Dong,
>
> Your comments on KIP-179 prompted me to look at KIP-113, and I have a
> question:
>
> AFAICS the DescribeDirsResponse (via ReplicaInfo) can be used to get the
> size of a partition on a disk, but I don't see a mechanism for knowing the
> total capacity of a disk (and/or the free capacity of a disk). That would
> be very useful information to have to help figure out that certain
> assignments are impossible, for instance. Is there a reason you've left
> this out?
>
> Cheers,
>
> Tom
>
> On 4 August 2017 at 18:47, Dong Lin <lindon...@gmail.com> wrote:
>
> > Hey Ismael,
> >
> > Thanks for the comments! Here are my answers:
> >
> > 1. Yes it has been considered. Here are the reasons why we don't do it
> > through controller.
> >
> > - There can be use-cases where we only want to rebalance the load of log
> > directories on a given broker. It seems unnecessary to go through
> > controller in this case.
> >
> >  - If controller is responsible for sending ChangeReplicaDirRequest, and
> if
> > the user-specified log directory is either invalid or offline, then
> > controller probably needs a way to tell user that the partition
> > reassignment has failed. We currently don't have a way to do this since
> > kafka-reassign-partition.sh simply creates the reassignment znode without
> > waiting for response. I am not sure that is a good solution to this.
> >
> > - If controller is responsible for sending ChangeReplicaDirRequest, the
> > controller logic would be more complicated because controller needs to
> > first send ChangeReplicaRequest so that the broker memorize the partition
> > -> log directory mapping, send LeaderAndIsrRequest, and keep sending
> > ChangeReplicaDirRequest (just in case broker restarted) until replica is
> > created. Note that the last step needs repeat and timeout as the proposed
> > in the KIP-113.
> >
> > Overall I think this adds quite a bit complexity to controller and we
> > probably want to do this only if there is strong clear of doing so.
> > Currently in KIP-113 the kafka-reassign-partitions.sh is responsible for
> > sending ChangeReplicaDirRequest with repeat and provides error to user if
> > it either fails or timeout. It seems to be much simpler and user
> shouldn't
> > care whether it is done through controller.
> >
> > And thanks for the suggestion. I will add this to the Rejected
> Alternative
> > Section in the KIP-113.
> >
> > 2) I think user needs to be able to specify different log directories for
> > the replicas of the same partition in order to rebalance load across log
> > directories of all brokers. I am not sure I understand the question. Can
> > you explain a bit more why "that the log directory has to be the same for
> > all replicas of a given partition"?
> >
> > 3) Good point. I think the alterReplicaDir is a better than
> > changeReplicaDir for the reason you provided. I will also update names of
> > the request/response as well in the KIP.
> >
> >
> > Thanks,
> > Dong
> >
> > On Fri, Aug 4, 2017 at 9:49 AM, Ismael Juma <ism...@juma.me.uk> wrote:
> >
> > > Thanks Dong. I have a few initial questions, sorry if I it has been
> > > discussed and I missed it.
> > >
> > > 1. The KIP suggests that the reassignment tool is responsible for
> sending
> > > the ChangeReplicaDirRequests to the relevant brokers. I had imagined
> that
> > > this would be done by the Controller, like the rest of the reassignment
> > > process. Was this considered? If so, it would be good to include the
> > > details of why it was rejected in the "Rejected Alternatives" section.
> > >
> > > 2. The reassignment JSON format was extended so that one can choose the
> > log
> > > directory for a partition. This means that the log directory has to be
> > the
> > > same for all replicas of a given partition. The alternative would be
> for
> > > the log dir to be assignable for each replica. Similar to the other
> > > question, it would be good to have a section in "Rejected Alternatives"
> > for
> > > this approach. It's generally very helpful to have more information on
> > the
> > > rationale for the design choices that were made and rejected.
> > >
> > > 3. Should changeReplicaDir be alterReplicaDir? We have used `alter` for
> > > other methods.
> > >
> > > Thanks,
> > > Ismael
> > >
> > >
> > >
> > >
> > > On Fri, Aug 4, 2017 at 5:37 AM, Dong Lin <lindon...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I realized that we need new API in AdminClient in order to use the
> new
> > > > request/response added in KIP-113. Since this is required by
> KIP-113, I
> > > > choose to add the new interface in this KIP instead of creating a new
> > > KIP.
> > > >
> > > > The documentation of the new API in AdminClient can be found here
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 113%3A+Support+replicas+movement+between+log+directories#KIP-113:
> > > > Supportreplicasmovementbetweenlogdirectories-AdminClient>.
> > > > Can you please review and comment if you have any concern?
> > > >
> > > > Thanks!
> > > > Dong
> > > >
> > > >
> > > >
> > > > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com>
> > wrote:
> > > >
> > > > > The protocol change has been updated in KIP-113
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 113%3A+Support+replicas+movement+between+log+directories>
> > > > > .
> > > > >
> > > > > On Wed, Jul 12, 2017 at 10:44 AM, Dong Lin <lindon...@gmail.com>
> > > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> I have made a minor change to the DescribeDirsRequest so that user
> > can
> > > > >> choose to query the status for a specific list of partitions. This
> > is
> > > a
> > > > bit
> > > > >> more fine-granular than the previous format that allows user to
> > query
> > > > the
> > > > >> status for a specific list of topics. I realized that querying the
> > > > status
> > > > >> of selected partitions can be useful to check the whether the
> > > > reassignment
> > > > >> of the replicas to the specific log directories has been
> completed.
> > > > >>
> > > > >> I will assume this minor change is OK if there is no concern with
> it
> > > in
> > > > >> the community :)
> > > > >>
> > > > >> Thanks,
> > > > >> Dong
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Jun 12, 2017 at 10:46 AM, Dong Lin <lindon...@gmail.com>
> > > wrote:
> > > > >>
> > > > >>> Hey Colin,
> > > > >>>
> > > > >>> Thanks for the suggestion. We have actually considered this and
> > list
> > > > >>> this as the first future work in KIP-112
> > > > >>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 112%3A+Handle+disk+failure+for+JBOD>.
> > > > >>> The two advantages that you mentioned are exactly the motivation
> > for
> > > > this
> > > > >>> feature. Also as you have mentioned, this involves the tradeoff
> > > between
> > > > >>> disk performance and availability -- the more you distribute
> topic
> > > > across
> > > > >>> disks, the more topics will be offline due to a single disk
> > failure.
> > > > >>>
> > > > >>> Despite its complexity, it is not clear to me that the reduced
> > > > rebalance
> > > > >>> overhead is worth the reduction in availability. I am optimistic
> > that
> > > > the
> > > > >>> rebalance overhead will not be that a big problem since we are
> not
> > > too
> > > > >>> bothered by cross-broker rebalance as of now.
> > > > >>>
> > > > >>> Thanks,
> > > > >>> Dong
> > > > >>>
> > > > >>> On Mon, Jun 12, 2017 at 10:36 AM, Colin McCabe <
> cmcc...@apache.org
> > >
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Has anyone considered a scheme for sharding topic data across
> > > multiple
> > > > >>>> disks?
> > > > >>>>
> > > > >>>> For example, if you sharded topics across 3 disks, and you had
> 10
> > > > disks,
> > > > >>>> you could pick a different set of 3 disks for each topic.  If
> you
> > > > >>>> distribute them randomly then you have 10 choose 3 = 120
> different
> > > > >>>> combinations.  You would probably never need rebalancing if you
> > had
> > > a
> > > > >>>> reasonable distribution of topic sizes (could probably prove
> this
> > > > with a
> > > > >>>> Monte Carlo or something).
> > > > >>>>
> > > > >>>> The disadvantage is that if one of the 3 disks fails, then you
> > have
> > > to
> > > > >>>> take the topic offline.  But if we assume independent disk
> failure
> > > > >>>> probabilities, probability of failure with RAID 0 is: 1 -
> > > > >>>> Psuccess^(num_disks) whereas the probability of failure with
> this
> > > > scheme
> > > > >>>> is 1 - Psuccess ^ 3.
> > > > >>>>
> > > > >>>> This addresses the biggest downsides of JBOD now:
> > > > >>>> * limiting a topic to the size of a single disk limits
> scalability
> > > > >>>> * the topic movement process is tricky to get right and involves
> > > > "racing
> > > > >>>> against producers" and wasted double I/Os
> > > > >>>>
> > > > >>>> Of course, one other question is how frequently we add new disk
> > > drives
> > > > >>>> to an existing broker.  In this case, you might reasonably want
> > disk
> > > > >>>> rebalancing to avoid overloading the new disk(s) with writes.
> > > > >>>>
> > > > >>>> cheers,
> > > > >>>> Colin
> > > > >>>>
> > > > >>>>
> > > > >>>> On Fri, Jun 9, 2017, at 18:46, Jun Rao wrote:
> > > > >>>> > Just a few comments on this.
> > > > >>>> >
> > > > >>>> > 1. One of the issues with using RAID 0 is that a single disk
> > > failure
> > > > >>>> > causes
> > > > >>>> > a hard failure of the broker. Hard failure increases the
> > > > >>>> unavailability
> > > > >>>> > window for all the partitions on the failed broker, which
> > includes
> > > > the
> > > > >>>> > failure detection time (tied to ZK session timeout right now)
> > and
> > > > >>>> leader
> > > > >>>> > election time by the controller. If we support JBOD natively,
> > > when a
> > > > >>>> > single
> > > > >>>> > disk fails, only partitions on the failed disk will
> experience a
> > > > hard
> > > > >>>> > failure. The availability for partitions on the rest of the
> > disks
> > > > are
> > > > >>>> not
> > > > >>>> > affected.
> > > > >>>> >
> > > > >>>> > 2. For running things on the Cloud such as AWS. Currently,
> each
> > > EBS
> > > > >>>> > volume
> > > > >>>> > has a throughout limit of about 300MB/sec. If you get an
> > enhanced
> > > > EC2
> > > > >>>> > instance, you can get 20Gb/sec network. To saturate the
> network,
> > > you
> > > > >>>> may
> > > > >>>> > need about 7 EBS volumes. So, being able to support JBOD in
> the
> > > > Cloud
> > > > >>>> is
> > > > >>>> > still potentially useful.
> > > > >>>> >
> > > > >>>> > 3. On the benefit of balancing data across disks within the
> same
> > > > >>>> broker.
> > > > >>>> > Data imbalance can happen across brokers as well as across
> disks
> > > > >>>> within
> > > > >>>> > the
> > > > >>>> > same broker. Balancing the data across disks within the broker
> > has
> > > > the
> > > > >>>> > benefit of saving network bandwidth as Dong mentioned. So, if
> > > intra
> > > > >>>> > broker
> > > > >>>> > load balancing is possible, it's probably better to avoid the
> > more
> > > > >>>> > expensive inter broker load balancing. One of the reasons for
> > disk
> > > > >>>> > imbalance right now is that partitions within a broker are
> > > assigned
> > > > to
> > > > >>>> > disks just based on the partition count. So, it does seem
> > possible
> > > > for
> > > > >>>> > disks to get imbalanced from time to time. If someone can
> share
> > > some
> > > > >>>> > stats
> > > > >>>> > for that in practice, that will be very helpful.
> > > > >>>> >
> > > > >>>> > Thanks,
> > > > >>>> >
> > > > >>>> > Jun
> > > > >>>> >
> > > > >>>> >
> > > > >>>> > On Wed, Jun 7, 2017 at 2:30 PM, Dong Lin <lindon...@gmail.com
> >
> > > > wrote:
> > > > >>>> >
> > > > >>>> > > Hey Sriram,
> > > > >>>> > >
> > > > >>>> > > I think there is one way to explain why the ability to move
> > > > replica
> > > > >>>> between
> > > > >>>> > > disks can save space. Let's say the load is distributed to
> > disks
> > > > >>>> > > independent of the broker. Sooner or later, the load
> imbalance
> > > > will
> > > > >>>> exceed
> > > > >>>> > > a threshold and we will need to rebalance load across disks.
> > Now
> > > > our
> > > > >>>> > > questions is whether our rebalancing algorithm will be able
> to
> > > > take
> > > > >>>> > > advantage of locality by moving replicas between disks on
> the
> > > same
> > > > >>>> broker.
> > > > >>>> > >
> > > > >>>> > > Say for a given disk, there is 20% probability it is
> > overloaded,
> > > > 20%
> > > > >>>> > > probability it is underloaded, and 60% probability its load
> is
> > > > >>>> around the
> > > > >>>> > > expected average load if the cluster is well balanced. Then
> > for
> > > a
> > > > >>>> broker of
> > > > >>>> > > 10 disks, we would 2 disks need to have in-bound replica
> > > movement,
> > > > >>>> 2 disks
> > > > >>>> > > need to have out-bound replica movement, and 6 disks do not
> > need
> > > > >>>> replica
> > > > >>>> > > movement. Thus we would expect KIP-113 to be useful since we
> > > will
> > > > >>>> be able
> > > > >>>> > > to move replica from the two over-loaded disks to the two
> > > > >>>> under-loaded
> > > > >>>> > > disks on the same broKER. Does this make sense?
> > > > >>>> > >
> > > > >>>> > > Thanks,
> > > > >>>> > > Dong
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > > On Wed, Jun 7, 2017 at 2:12 PM, Dong Lin <
> lindon...@gmail.com
> > >
> > > > >>>> wrote:
> > > > >>>> > >
> > > > >>>> > > > Hey Sriram,
> > > > >>>> > > >
> > > > >>>> > > > Thanks for raising these concerns. Let me answer these
> > > questions
> > > > >>>> below:
> > > > >>>> > > >
> > > > >>>> > > > - The benefit of those additional complexity to move the
> > data
> > > > >>>> stored on a
> > > > >>>> > > > disk within the broker is to avoid network bandwidth
> usage.
> > > > >>>> Creating
> > > > >>>> > > > replica on another broker is less efficient than creating
> > > > replica
> > > > >>>> on
> > > > >>>> > > > another disk in the same broker IF there is actually
> > > > >>>> lightly-loaded disk
> > > > >>>> > > on
> > > > >>>> > > > the same broker.
> > > > >>>> > > >
> > > > >>>> > > > - In my opinion the rebalance algorithm would this: 1) we
> > > > balance
> > > > >>>> the
> > > > >>>> > > load
> > > > >>>> > > > across brokers using the same algorithm we are using
> today.
> > 2)
> > > > we
> > > > >>>> balance
> > > > >>>> > > > load across disk on a given broker using a greedy
> algorithm,
> > > > i.e.
> > > > >>>> move
> > > > >>>> > > > replica from the overloaded disk to lightly loaded disk.
> The
> > > > >>>> greedy
> > > > >>>> > > > algorithm would only consider the capacity and replica
> size.
> > > We
> > > > >>>> can
> > > > >>>> > > improve
> > > > >>>> > > > it to consider throughput in the future.
> > > > >>>> > > >
> > > > >>>> > > > - With 30 brokers with each having 10 disks, using the
> > > > rebalancing
> > > > >>>> > > algorithm,
> > > > >>>> > > > the chances of choosing disks within the broker can be
> high.
> > > > >>>> There will
> > > > >>>> > > > always be load imbalance across disks of the same broker
> for
> > > the
> > > > >>>> same
> > > > >>>> > > > reason that there will always be load imbalance across
> > > brokers.
> > > > >>>> The
> > > > >>>> > > > algorithm specified above will take advantage of the
> > locality,
> > > > >>>> i.e. first
> > > > >>>> > > > balance load across disks of the same broker, and only
> > balance
> > > > >>>> across
> > > > >>>> > > > brokers if some brokers are much more loaded than others.
> > > > >>>> > > >
> > > > >>>> > > > I think it is useful to note that the load imbalance
> across
> > > > disks
> > > > >>>> of the
> > > > >>>> > > > same broker is independent of the load imbalance across
> > > brokers.
> > > > >>>> Both are
> > > > >>>> > > > guaranteed to happen in any Kafka cluster for the same
> > reason,
> > > > >>>> i.e.
> > > > >>>> > > > variation in the partition size. Say broker 1 have two
> disks
> > > > that
> > > > >>>> are 80%
> > > > >>>> > > > loaded and 20% loaded. And broker 2 have two disks that
> are
> > > also
> > > > >>>> 80%
> > > > >>>> > > > loaded and 20%. We can balance them without inter-broker
> > > traffic
> > > > >>>> with
> > > > >>>> > > > KIP-113.  This is why I think KIP-113 can be very useful.
> > > > >>>> > > >
> > > > >>>> > > > Do these explanation sound reasonable?
> > > > >>>> > > >
> > > > >>>> > > > Thanks,
> > > > >>>> > > > Dong
> > > > >>>> > > >
> > > > >>>> > > >
> > > > >>>> > > > On Wed, Jun 7, 2017 at 1:33 PM, Sriram Subramanian <
> > > > >>>> r...@confluent.io>
> > > > >>>> > > > wrote:
> > > > >>>> > > >
> > > > >>>> > > >> Hey Dong,
> > > > >>>> > > >>
> > > > >>>> > > >> Thanks for the explanation. I don't think anyone is
> denying
> > > > that
> > > > >>>> we
> > > > >>>> > > should
> > > > >>>> > > >> rebalance at the disk level. I think it is important to
> > > restore
> > > > >>>> the disk
> > > > >>>> > > >> and not wait for disk replacement. There are also other
> > > > benefits
> > > > >>>> of
> > > > >>>> > > doing
> > > > >>>> > > >> that which is that you don't need to opt for hot swap
> racks
> > > > that
> > > > >>>> can
> > > > >>>> > > save
> > > > >>>> > > >> cost.
> > > > >>>> > > >>
> > > > >>>> > > >> The question here is what do you save by trying to add
> > > > >>>> complexity to
> > > > >>>> > > move
> > > > >>>> > > >> the data stored on a disk within the broker? Why would
> you
> > > not
> > > > >>>> simply
> > > > >>>> > > >> create another replica on the disk that results in a
> > balanced
> > > > >>>> load
> > > > >>>> > > across
> > > > >>>> > > >> brokers and have it catch up. We are missing a few things
> > > here
> > > > -
> > > > >>>> > > >> 1. What would your data balancing algorithm be? Would it
> > > > include
> > > > >>>> just
> > > > >>>> > > >> capacity or will it also consider throughput on disk to
> > > decide
> > > > >>>> on the
> > > > >>>> > > >> final
> > > > >>>> > > >> location of a partition?
> > > > >>>> > > >> 2. With 30 brokers with each having 10 disks, using the
> > > > >>>> rebalancing
> > > > >>>> > > >> algorithm, the chances of choosing disks within the
> broker
> > is
> > > > >>>> going to
> > > > >>>> > > be
> > > > >>>> > > >> low. This probability further decreases with more brokers
> > and
> > > > >>>> disks.
> > > > >>>> > > Given
> > > > >>>> > > >> that, why are we trying to save network cost? How much
> > would
> > > > >>>> that saving
> > > > >>>> > > >> be
> > > > >>>> > > >> if you go that route?
> > > > >>>> > > >>
> > > > >>>> > > >> These questions are hard to answer without having to
> verify
> > > > >>>> empirically.
> > > > >>>> > > >> My
> > > > >>>> > > >> suggestion is to avoid doing pre matured optimization
> that
> > > > >>>> brings in the
> > > > >>>> > > >> added complexity to the code and treat inter and intra
> > broker
> > > > >>>> movements
> > > > >>>> > > of
> > > > >>>> > > >> partition the same. Deploy the code, use it and see if it
> > is
> > > an
> > > > >>>> actual
> > > > >>>> > > >> problem and you get great savings by avoiding the network
> > > route
> > > > >>>> to move
> > > > >>>> > > >> partitions within the same broker. If so, add this
> > > > optimization.
> > > > >>>> > > >>
> > > > >>>> > > >> On Wed, Jun 7, 2017 at 1:03 PM, Dong Lin <
> > > lindon...@gmail.com>
> > > > >>>> wrote:
> > > > >>>> > > >>
> > > > >>>> > > >> > Hey Jay, Sriram,
> > > > >>>> > > >> >
> > > > >>>> > > >> > Great point. If I understand you right, you are
> > suggesting
> > > > >>>> that we can
> > > > >>>> > > >> > simply use RAID-0 so that the load can be evenly
> > > distributed
> > > > >>>> across
> > > > >>>> > > >> disks.
> > > > >>>> > > >> > And even though a disk failure will bring down the
> enter
> > > > >>>> broker, the
> > > > >>>> > > >> > reduced availability as compared to using KIP-112 and
> > > KIP-113
> > > > >>>> will may
> > > > >>>> > > >> be
> > > > >>>> > > >> > negligible. And it may be better to just accept the
> > > slightly
> > > > >>>> reduced
> > > > >>>> > > >> > availability instead of introducing the complexity from
> > > > >>>> KIP-112 and
> > > > >>>> > > >> > KIP-113.
> > > > >>>> > > >> >
> > > > >>>> > > >> > Let's assume the following:
> > > > >>>> > > >> >
> > > > >>>> > > >> > - There are 30 brokers in a cluster and each broker has
> > 10
> > > > >>>> disks
> > > > >>>> > > >> > - The replication factor is 3 and min.isr = 2.
> > > > >>>> > > >> > - The probability of annual disk failure rate is 2%
> > > according
> > > > >>>> to this
> > > > >>>> > > >> > <https://www.backblaze.com/blo
> > > g/hard-drive-failure-rates-q1-
> > > > >>>> 2017/>
> > > > >>>> > > >> blog.
> > > > >>>> > > >> > - It takes 3 days to replace a disk.
> > > > >>>> > > >> >
> > > > >>>> > > >> > Here is my calculation for probability of data loss due
> > to
> > > > disk
> > > > >>>> > > failure:
> > > > >>>> > > >> > probability of a given disk fails in a given year: 2%
> > > > >>>> > > >> > probability of a given disk stays offline for one day
> in
> > a
> > > > >>>> given day:
> > > > >>>> > > >> 2% /
> > > > >>>> > > >> > 365 * 3
> > > > >>>> > > >> > probability of a given broker stays offline for one day
> > in
> > > a
> > > > >>>> given day
> > > > >>>> > > >> due
> > > > >>>> > > >> > to disk failure: 2% / 365 * 3 * 10
> > > > >>>> > > >> > probability of any broker stays offline for one day in
> a
> > > > given
> > > > >>>> day due
> > > > >>>> > > >> to
> > > > >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
> > > > >>>> > > >> > probability of any three broker stays offline for one
> day
> > > in
> > > > a
> > > > >>>> given
> > > > >>>> > > day
> > > > >>>> > > >> > due to disk failure: 5% * 5% * 5% = 0.0125%
> > > > >>>> > > >> > probability of data loss due to disk failure: 0.0125%
> > > > >>>> > > >> >
> > > > >>>> > > >> > Here is my calculation for probability of service
> > > > >>>> unavailability due
> > > > >>>> > > to
> > > > >>>> > > >> > disk failure:
> > > > >>>> > > >> > probability of a given disk fails in a given year: 2%
> > > > >>>> > > >> > probability of a given disk stays offline for one day
> in
> > a
> > > > >>>> given day:
> > > > >>>> > > >> 2% /
> > > > >>>> > > >> > 365 * 3
> > > > >>>> > > >> > probability of a given broker stays offline for one day
> > in
> > > a
> > > > >>>> given day
> > > > >>>> > > >> due
> > > > >>>> > > >> > to disk failure: 2% / 365 * 3 * 10
> > > > >>>> > > >> > probability of any broker stays offline for one day in
> a
> > > > given
> > > > >>>> day due
> > > > >>>> > > >> to
> > > > >>>> > > >> > disk failure: 2% / 365 * 3 * 10 * 30 = 5%
> > > > >>>> > > >> > probability of any two broker stays offline for one day
> > in
> > > a
> > > > >>>> given day
> > > > >>>> > > >> due
> > > > >>>> > > >> > to disk failure: 5% * 5% * 5% = 0.25%
> > > > >>>> > > >> > probability of unavailability due to disk failure:
> 0.25%
> > > > >>>> > > >> >
> > > > >>>> > > >> > Note that the unavailability due to disk failure will
> be
> > > > >>>> unacceptably
> > > > >>>> > > >> high
> > > > >>>> > > >> > in this case. And the probability of data loss due to
> > disk
> > > > >>>> failure
> > > > >>>> > > will
> > > > >>>> > > >> be
> > > > >>>> > > >> > higher than 0.01%. Neither is acceptable if Kafka is
> > > intended
> > > > >>>> to
> > > > >>>> > > achieve
> > > > >>>> > > >> > four nigh availability.
> > > > >>>> > > >> >
> > > > >>>> > > >> > Thanks,
> > > > >>>> > > >> > Dong
> > > > >>>> > > >> >
> > > > >>>> > > >> >
> > > > >>>> > > >> > On Tue, Jun 6, 2017 at 11:26 PM, Jay Kreps <
> > > j...@confluent.io
> > > > >
> > > > >>>> wrote:
> > > > >>>> > > >> >
> > > > >>>> > > >> > > I think Ram's point is that in place failure is
> pretty
> > > > >>>> complicated,
> > > > >>>> > > >> and
> > > > >>>> > > >> > > this is meant to be a cost saving feature, we should
> > > > >>>> construct an
> > > > >>>> > > >> > argument
> > > > >>>> > > >> > > for it grounded in data.
> > > > >>>> > > >> > >
> > > > >>>> > > >> > > Assume an annual failure rate of 1% (reasonable, but
> > data
> > > > is
> > > > >>>> > > available
> > > > >>>> > > >> > > online), and assume it takes 3 days to get the drive
> > > > >>>> replaced. Say
> > > > >>>> > > you
> > > > >>>> > > >> > have
> > > > >>>> > > >> > > 10 drives per server. Then the expected downtime for
> > each
> > > > >>>> server is
> > > > >>>> > > >> > roughly
> > > > >>>> > > >> > > 1% * 3 days * 10 = 0.3 days/year (this is slightly
> off
> > > > since
> > > > >>>> I'm
> > > > >>>> > > >> ignoring
> > > > >>>> > > >> > > the case of multiple failures, but I don't know that
> > > > changes
> > > > >>>> it
> > > > >>>> > > >> much). So
> > > > >>>> > > >> > > the savings from this feature is 0.3/365 = 0.08%. Say
> > you
> > > > >>>> have 1000
> > > > >>>> > > >> > servers
> > > > >>>> > > >> > > and they cost $3000/year fully loaded including
> power,
> > > the
> > > > >>>> cost of
> > > > >>>> > > >> the hw
> > > > >>>> > > >> > > amortized over it's life, etc. Then this feature
> saves
> > > you
> > > > >>>> $3000 on
> > > > >>>> > > >> your
> > > > >>>> > > >> > > total server cost of $3m which seems not very
> > worthwhile
> > > > >>>> compared to
> > > > >>>> > > >> > other
> > > > >>>> > > >> > > optimizations...?
> > > > >>>> > > >> > >
> > > > >>>> > > >> > > Anyhow, not sure the arithmetic is right there, but i
> > > think
> > > > >>>> that is
> > > > >>>> > > >> the
> > > > >>>> > > >> > > type of argument that would be helpful to think about
> > the
> > > > >>>> tradeoff
> > > > >>>> > > in
> > > > >>>> > > >> > > complexity.
> > > > >>>> > > >> > >
> > > > >>>> > > >> > > -Jay
> > > > >>>> > > >> > >
> > > > >>>> > > >> > >
> > > > >>>> > > >> > >
> > > > >>>> > > >> > > On Tue, Jun 6, 2017 at 7:09 PM, Dong Lin <
> > > > >>>> lindon...@gmail.com>
> > > > >>>> > > wrote:
> > > > >>>> > > >> > >
> > > > >>>> > > >> > > > Hey Sriram,
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > Thanks for taking time to review the KIP. Please
> see
> > > > below
> > > > >>>> my
> > > > >>>> > > >> answers
> > > > >>>> > > >> > to
> > > > >>>> > > >> > > > your questions:
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > >1. Could you pick a hardware/Kafka configuration
> and
> > > go
> > > > >>>> over what
> > > > >>>> > > >> is
> > > > >>>> > > >> > the
> > > > >>>> > > >> > > > >average disk/partition repair/restore time that we
> > are
> > > > >>>> targeting
> > > > >>>> > > >> for a
> > > > >>>> > > >> > > > >typical JBOD setup?
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > We currently don't have this data. I think the
> > > > >>>> disk/partition
> > > > >>>> > > >> > > repair/store
> > > > >>>> > > >> > > > time depends on availability of hardware, the
> > response
> > > > >>>> time of
> > > > >>>> > > >> > > > site-reliability engineer, the amount of data on
> the
> > > bad
> > > > >>>> disk etc.
> > > > >>>> > > >> > These
> > > > >>>> > > >> > > > vary between companies and even clusters within the
> > > same
> > > > >>>> company
> > > > >>>> > > >> and it
> > > > >>>> > > >> > > is
> > > > >>>> > > >> > > > probably hard to determine what is the average
> > > situation.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > I am not very sure why we need this. Can you
> explain
> > a
> > > > bit
> > > > >>>> why
> > > > >>>> > > this
> > > > >>>> > > >> > data
> > > > >>>> > > >> > > is
> > > > >>>> > > >> > > > useful to evaluate the motivation and design of
> this
> > > KIP?
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > >2. How often do we believe disks are going to fail
> > (in
> > > > >>>> your
> > > > >>>> > > example
> > > > >>>> > > >> > > > >configuration) and what do we gain by avoiding the
> > > > network
> > > > >>>> > > overhead
> > > > >>>> > > >> > and
> > > > >>>> > > >> > > > >doing all the work of moving the replica within
> the
> > > > >>>> broker to
> > > > >>>> > > >> another
> > > > >>>> > > >> > > disk
> > > > >>>> > > >> > > > >instead of balancing it globally?
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > I think the chance of disk failure depends mainly
> on
> > > the
> > > > >>>> disk
> > > > >>>> > > itself
> > > > >>>> > > >> > > rather
> > > > >>>> > > >> > > > than the broker configuration. I don't have this
> data
> > > > now.
> > > > >>>> I will
> > > > >>>> > > >> ask
> > > > >>>> > > >> > our
> > > > >>>> > > >> > > > SRE whether they know the mean-time-to-fail for our
> > > disk.
> > > > >>>> What I
> > > > >>>> > > was
> > > > >>>> > > >> > told
> > > > >>>> > > >> > > > by SRE is that disk failure is the most common type
> > of
> > > > >>>> hardware
> > > > >>>> > > >> > failure.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > When there is disk failure, I think it is
> reasonable
> > to
> > > > >>>> move
> > > > >>>> > > >> replica to
> > > > >>>> > > >> > > > another broker instead of another disk on the same
> > > > broker.
> > > > >>>> The
> > > > >>>> > > >> reason
> > > > >>>> > > >> > we
> > > > >>>> > > >> > > > want to move replica within broker is mainly to
> > > optimize
> > > > >>>> the Kafka
> > > > >>>> > > >> > > cluster
> > > > >>>> > > >> > > > performance when we balance load across disks.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > In comparison to balancing replicas globally, the
> > > benefit
> > > > >>>> of
> > > > >>>> > > moving
> > > > >>>> > > >> > > replica
> > > > >>>> > > >> > > > within broker is that:
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > 1) the movement is faster since it doesn't go
> through
> > > > >>>> socket or
> > > > >>>> > > >> rely on
> > > > >>>> > > >> > > the
> > > > >>>> > > >> > > > available network bandwidth;
> > > > >>>> > > >> > > > 2) much less impact on the replication traffic
> > between
> > > > >>>> broker by
> > > > >>>> > > not
> > > > >>>> > > >> > > taking
> > > > >>>> > > >> > > > up bandwidth between brokers. Depending on the
> > pattern
> > > of
> > > > >>>> traffic,
> > > > >>>> > > >> we
> > > > >>>> > > >> > may
> > > > >>>> > > >> > > > need to balance load across disk frequently and it
> is
> > > > >>>> necessary to
> > > > >>>> > > >> > > prevent
> > > > >>>> > > >> > > > this operation from slowing down the existing
> > operation
> > > > >>>> (e.g.
> > > > >>>> > > >> produce,
> > > > >>>> > > >> > > > consume, replication) in the Kafka cluster.
> > > > >>>> > > >> > > > 3) It gives us opportunity to do automatic broker
> > > > rebalance
> > > > >>>> > > between
> > > > >>>> > > >> > disks
> > > > >>>> > > >> > > > on the same broker.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > >3. Even if we had to move the replica within the
> > > broker,
> > > > >>>> why
> > > > >>>> > > >> cannot we
> > > > >>>> > > >> > > > just
> > > > >>>> > > >> > > > >treat it as another replica and have it go through
> > the
> > > > >>>> same
> > > > >>>> > > >> > replication
> > > > >>>> > > >> > > > >code path that we have today? The downside here is
> > > > >>>> obviously that
> > > > >>>> > > >> you
> > > > >>>> > > >> > > need
> > > > >>>> > > >> > > > >to catchup from the leader but it is completely
> > free!
> > > > >>>> What do we
> > > > >>>> > > >> think
> > > > >>>> > > >> > > is
> > > > >>>> > > >> > > > >the impact of the network overhead in this case?
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > Good point. My initial proposal actually used the
> > > > existing
> > > > >>>> > > >> > > > ReplicaFetcherThread (i.e. the existing code path)
> to
> > > > move
> > > > >>>> replica
> > > > >>>> > > >> > > between
> > > > >>>> > > >> > > > disks. However, I switched to use separate thread
> > pool
> > > > >>>> after
> > > > >>>> > > >> discussion
> > > > >>>> > > >> > > > with Jun and Becket.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > The main argument for using separate thread pool is
> > to
> > > > >>>> actually
> > > > >>>> > > keep
> > > > >>>> > > >> > the
> > > > >>>> > > >> > > > design simply and easy to reason about. There are a
> > > > number
> > > > >>>> of
> > > > >>>> > > >> > difference
> > > > >>>> > > >> > > > between inter-broker replication and intra-broker
> > > > >>>> replication
> > > > >>>> > > which
> > > > >>>> > > >> > makes
> > > > >>>> > > >> > > > it cleaner to do them in separate code path. I will
> > > list
> > > > >>>> them
> > > > >>>> > > below:
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > - The throttling mechanism for inter-broker
> > replication
> > > > >>>> traffic
> > > > >>>> > > and
> > > > >>>> > > >> > > > intra-broker replication traffic is different. For
> > > > >>>> example, we may
> > > > >>>> > > >> want
> > > > >>>> > > >> > > to
> > > > >>>> > > >> > > > specify per-topic quota for inter-broker
> replication
> > > > >>>> traffic
> > > > >>>> > > >> because we
> > > > >>>> > > >> > > may
> > > > >>>> > > >> > > > want some topic to be moved faster than other
> topic.
> > > But
> > > > >>>> we don't
> > > > >>>> > > >> care
> > > > >>>> > > >> > > > about priority of topics for intra-broker movement.
> > So
> > > > the
> > > > >>>> current
> > > > >>>> > > >> > > proposal
> > > > >>>> > > >> > > > only allows user to specify per-broker quota for
> > > > >>>> inter-broker
> > > > >>>> > > >> > replication
> > > > >>>> > > >> > > > traffic.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > - The quota value for inter-broker replication
> > traffic
> > > > and
> > > > >>>> > > >> intra-broker
> > > > >>>> > > >> > > > replication traffic is different. The available
> > > bandwidth
> > > > >>>> for
> > > > >>>> > > >> > > inter-broker
> > > > >>>> > > >> > > > replication can probably be much higher than the
> > > > bandwidth
> > > > >>>> for
> > > > >>>> > > >> > > inter-broker
> > > > >>>> > > >> > > > replication.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > - The ReplicaFetchThread is per broker.
> Intuitively,
> > > the
> > > > >>>> number of
> > > > >>>> > > >> > > threads
> > > > >>>> > > >> > > > doing intra broker data movement should be related
> to
> > > the
> > > > >>>> number
> > > > >>>> > > of
> > > > >>>> > > >> > disks
> > > > >>>> > > >> > > > in the broker, not the number of brokers in the
> > > cluster.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > - The leader replica has no ReplicaFetchThread to
> > start
> > > > >>>> with. It
> > > > >>>> > > >> seems
> > > > >>>> > > >> > > > weird to
> > > > >>>> > > >> > > > start one just for intra-broker replication.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > Because of these difference, we think it is simpler
> > to
> > > > use
> > > > >>>> > > separate
> > > > >>>> > > >> > > thread
> > > > >>>> > > >> > > > pool and code path so that we can configure and
> > > throttle
> > > > >>>> them
> > > > >>>> > > >> > separately.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > >4. What are the chances that we will be able to
> > > identify
> > > > >>>> another
> > > > >>>> > > >> disk
> > > > >>>> > > >> > to
> > > > >>>> > > >> > > > >balance within the broker instead of another disk
> on
> > > > >>>> another
> > > > >>>> > > >> broker?
> > > > >>>> > > >> > If
> > > > >>>> > > >> > > we
> > > > >>>> > > >> > > > >have 100's of machines, the probability of
> finding a
> > > > >>>> better
> > > > >>>> > > >> balance by
> > > > >>>> > > >> > > > >choosing another broker is much higher than
> > balancing
> > > > >>>> within the
> > > > >>>> > > >> > broker.
> > > > >>>> > > >> > > > >Could you add some info on how we are determining
> > > this?
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > It is possible that we can find available space on
> a
> > > > remote
> > > > >>>> > > broker.
> > > > >>>> > > >> The
> > > > >>>> > > >> > > > benefit of allowing intra-broker replication is
> that,
> > > > when
> > > > >>>> there
> > > > >>>> > > are
> > > > >>>> > > >> > > > available space in both the current broker and a
> > remote
> > > > >>>> broker,
> > > > >>>> > > the
> > > > >>>> > > >> > > > rebalance can be completed faster with much less
> > impact
> > > > on
> > > > >>>> the
> > > > >>>> > > >> > > inter-broker
> > > > >>>> > > >> > > > replication or the users traffic. It is about
> taking
> > > > >>>> advantage of
> > > > >>>> > > >> > > locality
> > > > >>>> > > >> > > > when balance the load.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > >5. Finally, in a cloud setup where more users are
> > > going
> > > > to
> > > > >>>> > > >> leverage a
> > > > >>>> > > >> > > > >shared filesystem (example, EBS in AWS), all this
> > > change
> > > > >>>> is not
> > > > >>>> > > of
> > > > >>>> > > >> > much
> > > > >>>> > > >> > > > >gain since you don't need to balance between the
> > > volumes
> > > > >>>> within
> > > > >>>> > > the
> > > > >>>> > > >> > same
> > > > >>>> > > >> > > > >broker.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > > > You are right. This KIP-113 is useful only if user
> > uses
> > > > >>>> JBOD. If
> > > > >>>> > > >> user
> > > > >>>> > > >> > > uses
> > > > >>>> > > >> > > > an extra storage layer of replication, such as
> > RAID-10
> > > or
> > > > >>>> EBS,
> > > > >>>> > > they
> > > > >>>> > > >> > don't
> > > > >>>> > > >> > > > need KIP-112 or KIP-113. Note that user will
> > replicate
> > > > >>>> data more
> > > > >>>> > > >> times
> > > > >>>> > > >> > > than
> > > > >>>> > > >> > > > the replication factor of the Kafka topic if an
> extra
> > > > >>>> storage
> > > > >>>> > > layer
> > > > >>>> > > >> of
> > > > >>>> > > >> > > > replication is used.
> > > > >>>> > > >> > > >
> > > > >>>> > > >> > >
> > > > >>>> > > >> >
> > > > >>>> > > >>
> > > > >>>> > > >
> > > > >>>> > > >
> > > > >>>> > >
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-113: Support replicas movement between log directories

Reply via email to