This all makes a lot of sense, and mirrors what I’m thinking as I finally
took some time to really walk through scenarios around why we move
partitions around.

What I’m wondering if it makes sense to have a conversation around breaking
out the controller entirely, separating it from the brokers, and starting
to add this intelligence into that. I don’t think anyone will disagree that
the controller needs a sizable amount of work. This definitely wouldn’t be
the first project to separate out the brains from the dumb worker processes.

-Todd


On Thu, Aug 18, 2016 at 10:53 AM, Gwen Shapira <g...@confluent.io> wrote:

> Just my take, since Jun and Ben originally wanted to solve a more
> general approach and I talked them out of it :)
>
> When we first add the feature, safety is probably most important in
> getting people to adopt it - I wanted to make the feature very safe by
> never throttling something admins don't want to throttle. So we
> figured manual approach, while more challenging to configure, is the
> safest. Admins usually know which replicas are "at risk" of taking
> over and can choose to throttle them accordingly, they can build their
> own integration with monitoring tools, etc.
>
> It feels like any "smarts" we try and build into Kafka can be done
> better with external tools that can watch both Kafka traffic (with the
> new metrics) and things like network and CPU monitors.
>
> We are open to a smarter approach in Kafka, but perhaps plan it for a
> follow-up KIP? Maybe even after we have some experience with the
> manual approach and how best to make throttling decisions.
> Similar to what we do with choosing partitions to move around - we
> started manually, admins are getting experience at how they like to
> choose replicas and then we can bake their expertise into the product.
>
> Gwen
>
> On Thu, Aug 18, 2016 at 10:29 AM, Jun Rao <j...@confluent.io> wrote:
> > Joel,
> >
> > Yes, for your second comment. The tricky thing is still to figure out
> which
> > replicas to throttle and by how much since in general, admins probably
> > don't want already in-sync or close to in-sync replicas to be throttled.
> It
> > would be great to get Todd's opinion on this. Could you ping him?
> >
> > Yes, we'd be happy to discuss auto-detection of effect traffic more
> offline.
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Aug 18, 2016 at 10:21 AM, Joel Koshy <jjkosh...@gmail.com>
> wrote:
> >
> >> > For your first comment. We thought about determining "effect" replicas
> >> > automatically as well. First, there are some tricky stuff that one
> has to
> >> >
> >>
> >> Auto-detection of effect traffic: i'm fairly certain it's doable but
> >> definitely tricky. I'm also not sure it is something worth tackling at
> the
> >> outset. If we want to spend more time thinking over it even if it's
> just an
> >> academic exercise I would be happy to brainstorm offline.
> >>
> >>
> >> > For your second comment, we discussed that in the client quotas
> design. A
> >> > down side of that for client quotas is that a client may be surprised
> >> that
> >> > its traffic is not throttled at one time, but throttled as another
> with
> >> the
> >> > same quota (basically, less predicability). You can imaging setting a
> >> quota
> >> > for all replication traffic and only slow down the "effect" replicas
> if
> >> > needed. The thought is more or less the same as the above. It requires
> >> more
> >> >
> >>
> >> For clients, this is true. I think this is much less of an issue for
> >> server-side replication since the "users" here are the Kafka SREs who
> >> generally know these internal details.
> >>
> >> I think it would be valuable to get some feedback from SREs on the
> proposal
> >> before proceeding to a vote. (ping Todd)
> >>
> >> Joel
> >>
> >>
> >> >
> >> > On Thu, Aug 18, 2016 at 9:37 AM, Ben Stopford <b...@confluent.io>
> wrote:
> >> >
> >> > > Hi Joel
> >> > >
> >> > > Ha! yes we had some similar thoughts, on both counts. Both are
> actually
> >> > > good approaches, but come with some extra complexity.
> >> > >
> >> > > Segregating the replication type is tempting as it creates a more
> >> general
> >> > > solution. One issue is you need to draw a line between lagging and
> not
> >> > > lagging. The ISR ‘limit' is a tempting divider, but has the side
> effect
> >> > > that, once you drop out you get immediately throttled. Adding a
> >> > > configurable divider is another option, but difficult for admins to
> >> set,
> >> > > and always a little arbitrary. A better idea is to prioritise, in
> >> reverse
> >> > > order to lag. But that also comes with additional complexity of its
> >> own.
> >> > >
> >> > > Under throttling is also a tempting addition. That’s to say, if
> there’s
> >> > > idle bandwidth lying around, not being used, why not use it to let
> >> > lagging
> >> > > brokers catch up. This involves some comparison to the maximum
> >> bandwidth,
> >> > > which could be configurable, or could be derived, with pros and cons
> >> for
> >> > > each.
> >> > >
> >> > > But the more general problem is actually quite hard to reason
> about, so
> >> > > after some discussion we decided to settle on something simple,
> that we
> >> > > felt we could get working, and extend to add these additional
> features
> >> as
> >> > > subsequent KIPs.
> >> > >
> >> > > I hope that seems reasonable. Jun may wish to add to this.
> >> > >
> >> > > B
> >> > >
> >> > >
> >> > > > On 18 Aug 2016, at 06:56, Joel Koshy <jjkosh...@gmail.com> wrote:
> >> > > >
> >> > > > On Wed, Aug 17, 2016 at 9:13 PM, Ben Stopford <b...@confluent.io>
> >> > wrote:
> >> > > >
> >> > > >>
> >> > > >> Let's us know if you have any further thoughts on KIP-73, else
> we'll
> >> > > kick
> >> > > >> off a vote.
> >> > > >>
> >> > > >
> >> > > > I think the mechanism for throttling replicas looks good. Just
> had a
> >> > few
> >> > > > more thoughts on the configuration section. What you have looks
> >> > > reasonable,
> >> > > > but I was wondering if it could be made simpler. You probably
> thought
> >> > > > through these, so I'm curious to know your take.
> >> > > >
> >> > > > My guess is that most of the time, users would want to throttle
> all
> >> > > effect
> >> > > > replication - due to partition reassignments, adding brokers or a
> >> > broker
> >> > > > coming back online after an extended period of time. In all these
> >> > > scenarios
> >> > > > it may be possible to distinguish bootstrap (effect) vs normal
> >> > > replication
> >> > > > - based on how far the replica has to catch up. I'm wondering if
> it
> >> is
> >> > > > enough to just set an umbrella "effect" replication quota with
> >> perhaps
> >> > > > per-topic overrides (say if some topics are more important than
> >> others)
> >> > > as
> >> > > > opposed to designating throttled replicas.
> >> > > >
> >> > > > Also, IIRC during client-side quota discussions we had considered
> the
> >> > > > possibility of allowing clients to go above their quotas when
> >> resources
> >> > > are
> >> > > > available. We ended up not doing that, but for replication
> throttling
> >> > it
> >> > > > may make sense - i.e., to treat the quota as a soft limit. Another
> >> way
> >> > to
> >> > > > look at it is instead of ensuring "effect replication traffic does
> >> not
> >> > > flow
> >> > > > faster than X bytes/sec" it may be useful to instead ensure that
> >> > "effect
> >> > > > replication traffic only flows as slowly as necessary (so as not
> to
> >> > > > adversely affect normal replication traffic)."
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Joel
> >> > > >
> >> > > >>>
> >> > > >>>> On Thu, Aug 11, 2016 at 2:43 PM, Jun Rao <j...@confluent.io
> >> > > >>> <javascript:;>> wrote:
> >> > > >>>>
> >> > > >>>>> Hi, Joel,
> >> > > >>>>>
> >> > > >>>>> Yes, the response size includes both throttled and unthrottled
> >> > > >>> replicas.
> >> > > >>>>> However, the response is only delayed up to max.wait if the
> >> > response
> >> > > >>> size
> >> > > >>>>> is less than min.bytes, which matches the current behavior.
> So,
> >> > there
> >> > > >>> is
> >> > > >>>> no
> >> > > >>>>> extra delay to due throttling, right? For replica fetchers,
> the
> >> > > >> default
> >> > > >>>>> min.byte is 1. So, the response is only delayed if there is no
> >> byte
> >> > > >> in
> >> > > >>>> the
> >> > > >>>>> response, which is what we want.
> >> > > >>>>>
> >> > > >>>>> Thanks,
> >> > > >>>>>
> >> > > >>>>> Jun
> >> > > >>>>>
> >> > > >>>>> On Thu, Aug 11, 2016 at 11:53 AM, Joel Koshy <
> >> jjkosh...@gmail.com
> >> > > >>> <javascript:;>>
> >> > > >>>> wrote:
> >> > > >>>>>
> >> > > >>>>>> Hi Jun,
> >> > > >>>>>>
> >> > > >>>>>> I'm not sure that would work unless we have separate replica
> >> > > >>> fetchers,
> >> > > >>>>>> since this would cause all replicas (including ones that are
> not
> >> > > >>>>> throttled)
> >> > > >>>>>> to get delayed. Instead, we could just have the leader
> populate
> >> > the
> >> > > >>>>>> throttle-time field of the response as a hint to the
> follower as
> >> > to
> >> > > >>> how
> >> > > >>>>>> long it should wait before it adds those replicas back to its
> >> > > >>>> subsequent
> >> > > >>>>>> replica fetch requests.
> >> > > >>>>>>
> >> > > >>>>>> Thanks,
> >> > > >>>>>>
> >> > > >>>>>> Joel
> >> > > >>>>>>
> >> > > >>>>>> On Thu, Aug 11, 2016 at 9:50 AM, Jun Rao <j...@confluent.io
> >> > > >>> <javascript:;>> wrote:
> >> > > >>>>>>
> >> > > >>>>>>> Mayuresh,
> >> > > >>>>>>>
> >> > > >>>>>>> That's a good question. I think if the response size (after
> >> > > >> leader
> >> > > >>>>>>> throttling) is smaller than min.bytes, we will just delay
> the
> >> > > >>> sending
> >> > > >>>>> of
> >> > > >>>>>>> the response up to max.wait as we do now. This should
> prevent
> >> > > >>>> frequent
> >> > > >>>>>>> empty responses to the follower.
> >> > > >>>>>>>
> >> > > >>>>>>> Thanks,
> >> > > >>>>>>>
> >> > > >>>>>>> Jun
> >> > > >>>>>>>
> >> > > >>>>>>> On Wed, Aug 10, 2016 at 9:17 PM, Mayuresh Gharat <
> >> > > >>>>>>> gharatmayures...@gmail.com <javascript:;>
> >> > > >>>>>>>> wrote:
> >> > > >>>>>>>
> >> > > >>>>>>>> This might have been answered before.
> >> > > >>>>>>>> I was wondering when the leader quota is reached and it
> sends
> >> > > >>> empty
> >> > > >>>>>>>> response ( If the inclusion of a partition, listed in the
> >> > > >>> leader's
> >> > > >>>>>>>> throttled-replicas list, causes the LeaderQuotaRate to be
> >> > > >>> exceeded,
> >> > > >>>>>> that
> >> > > >>>>>>>> partition is omitted from the response (aka returns 0
> >> bytes).).
> >> > > >>> At
> >> > > >>>>> this
> >> > > >>>>>>>> point the follower quota is NOT reached and the follower is
> >> > > >> still
> >> > > >>>>> going
> >> > > >>>>>>> to
> >> > > >>>>>>>> ask for the that partition in the next fetch request.
> Would it
> >> > > >> be
> >> > > >>>>> fair
> >> > > >>>>>> to
> >> > > >>>>>>>> add some logic there so that the follower backs off ( for
> some
> >> > > >>>>>>> configurable
> >> > > >>>>>>>> time) from including those partitions in the next fetch
> >> > > >> request?
> >> > > >>>>>>>>
> >> > > >>>>>>>> Thanks,
> >> > > >>>>>>>>
> >> > > >>>>>>>> Mayuresh
> >> > > >>>>>>>>
> >> > > >>>>>>>> On Wed, Aug 10, 2016 at 8:06 AM, Ben Stopford <
> >> > > >> b...@confluent.io
> >> > > >>> <javascript:;>>
> >> > > >>>>>> wrote:
> >> > > >>>>>>>>
> >> > > >>>>>>>>> Thanks again for the responses everyone. I’ve removed the
> the
> >> > > >>>> extra
> >> > > >>>>>>>>> fetcher threads from the proposal, switching to the
> >> > > >>>> inclusion-based
> >> > > >>>>>>>>> approach. The relevant section is:
> >> > > >>>>>>>>>
> >> > > >>>>>>>>> The follower makes a requests, using the fixed size of
> >> > > >>>>>>>>> replica.fetch.response.max.bytes as per KIP-74 <
> >> > > >>>>>>>> https://cwiki.apache.org/
> >> > > >>>>>>>>> confluence/display/KAFKA/KIP-
> 74%3A+Add+Fetch+Response+Size+
> >> > > >>>>>>>> Limit+in+Bytes>.
> >> > > >>>>>>>>> The order of the partitions in the fetch request are
> >> > > >> randomised
> >> > > >>>> to
> >> > > >>>>>>> ensure
> >> > > >>>>>>>>> fairness.
> >> > > >>>>>>>>> When the leader receives the fetch request it processes
> the
> >> > > >>>>>> partitions
> >> > > >>>>>>> in
> >> > > >>>>>>>>> the defined order, up to the response's size limit. If the
> >> > > >>>>> inclusion
> >> > > >>>>>>> of a
> >> > > >>>>>>>>> partition, listed in the leader's throttled-replicas list,
> >> > > >>> causes
> >> > > >>>>> the
> >> > > >>>>>>>>> LeaderQuotaRate to be exceeded, that partition is omitted
> >> > > >> from
> >> > > >>>> the
> >> > > >>>>>>>> response
> >> > > >>>>>>>>> (aka returns 0 bytes). Logically, this is of the form:
> >> > > >>>>>>>>> var bytesAllowedForThrottledPartition =
> >> > > >>>>> quota.recordAndMaybeAdjust(
> >> > > >>>>>>>>> bytesRequestedForPartition)
> >> > > >>>>>>>>> When the follower receives the fetch response, if it
> includes
> >> > > >>>>>>> partitions
> >> > > >>>>>>>>> in its throttled-partitions list, it increments the
> >> > > >>>>>> FollowerQuotaRate:
> >> > > >>>>>>>>> var includeThrottledPartitionsInNextRequest: Boolean =
> >> > > >>>>>>>>> quota.recordAndEvaluate(previousResponseThrottledBytes)
> >> > > >>>>>>>>> If the quota is exceeded, no throttled partitions will be
> >> > > >>>> included
> >> > > >>>>> in
> >> > > >>>>>>> the
> >> > > >>>>>>>>> next fetch request emitted by this replica fetcher thread.
> >> > > >>>>>>>>>
> >> > > >>>>>>>>> B
> >> > > >>>>>>>>>
> >> > > >>>>>>>>>> On 9 Aug 2016, at 23:34, Jun Rao <j...@confluent.io
> >> > > >>> <javascript:;>> wrote:
> >> > > >>>>>>>>>>
> >> > > >>>>>>>>>> When there are several unthrottled replicas, we could
> also
> >> > > >>> just
> >> > > >>>>> do
> >> > > >>>>>>>> what's
> >> > > >>>>>>>>>> suggested in KIP-74. The client is responsible for
> >> > > >> reordering
> >> > > >>>> the
> >> > > >>>>>>>>>> partitions and the leader fills in the bytes to those
> >> > > >>>> partitions
> >> > > >>>>> in
> >> > > >>>>>>>>> order,
> >> > > >>>>>>>>>> up to the quota limit.
> >> > > >>>>>>>>>>
> >> > > >>>>>>>>>> We could also do what you suggested. If quota is
> exceeded,
> >> > > >>>>> include
> >> > > >>>>>>>> empty
> >> > > >>>>>>>>>> data in the response for throttled replicas. Keep doing
> >> > > >> that
> >> > > >>>>> until
> >> > > >>>>>>>> enough
> >> > > >>>>>>>>>> time has passed so that the quota is no longer exceeded.
> >> > > >> This
> >> > > >>>>>>>> potentially
> >> > > >>>>>>>>>> allows better batching per partition. Not sure if the two
> >> > > >>>> makes a
> >> > > >>>>>> big
> >> > > >>>>>>>>>> difference in practice though.
> >> > > >>>>>>>>>>
> >> > > >>>>>>>>>> Thanks,
> >> > > >>>>>>>>>>
> >> > > >>>>>>>>>> Jun
> >> > > >>>>>>>>>>
> >> > > >>>>>>>>>>
> >> > > >>>>>>>>>> On Tue, Aug 9, 2016 at 2:31 PM, Joel Koshy <
> >> > > >>>> jjkosh...@gmail.com <javascript:;>>
> >> > > >>>>>>>> wrote:
> >> > > >>>>>>>>>>
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>> On the leader side, one challenge is related to the
> >> > > >>> fairness
> >> > > >>>>>> issue
> >> > > >>>>>>>> that
> >> > > >>>>>>>>>>> Ben
> >> > > >>>>>>>>>>>> brought up. The question is what if the fetch response
> >> > > >>> limit
> >> > > >>>> is
> >> > > >>>>>>>> filled
> >> > > >>>>>>>>> up
> >> > > >>>>>>>>>>>> by the throttled replicas? If this happens constantly,
> we
> >> > > >>>> will
> >> > > >>>>>>> delay
> >> > > >>>>>>>>> the
> >> > > >>>>>>>>>>>> progress of those un-throttled replicas. However, I
> think
> >> > > >>> we
> >> > > >>>>> can
> >> > > >>>>>>>>> address
> >> > > >>>>>>>>>>>> this issue by trying to fill up the unthrottled
> replicas
> >> > > >> in
> >> > > >>>> the
> >> > > >>>>>>>>> response
> >> > > >>>>>>>>>>>> first. So, the algorithm would be. Fill up unthrottled
> >> > > >>>> replicas
> >> > > >>>>>> up
> >> > > >>>>>>> to
> >> > > >>>>>>>>> the
> >> > > >>>>>>>>>>>> fetch response limit. If there is space left, fill up
> >> > > >>>> throttled
> >> > > >>>>>>>>> replicas.
> >> > > >>>>>>>>>>>> If quota is exceeded for the throttled replicas, reduce
> >> > > >> the
> >> > > >>>>> bytes
> >> > > >>>>>>> in
> >> > > >>>>>>>>> the
> >> > > >>>>>>>>>>>> throttled replicas in the response accordingly.
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>
> >> > > >>>>>>>>>>> Right - that's what I was trying to convey by truncation
> >> > > >> (vs
> >> > > >>>>>> empty).
> >> > > >>>>>>>> So
> >> > > >>>>>>>>> we
> >> > > >>>>>>>>>>> would attempt to fill the response for throttled
> >> > > >> partitions
> >> > > >>> as
> >> > > >>>>>> much
> >> > > >>>>>>> as
> >> > > >>>>>>>>> we
> >> > > >>>>>>>>>>> can before hitting the quota limit. There is one more
> >> > > >> detail
> >> > > >>>> to
> >> > > >>>>>>> handle
> >> > > >>>>>>>>> in
> >> > > >>>>>>>>>>> this: if there are several throttled partitions and not
> >> > > >>> enough
> >> > > >>>>>>>> remaining
> >> > > >>>>>>>>>>> allowance in the fetch response to include all the
> >> > > >> throttled
> >> > > >>>>>>> replicas
> >> > > >>>>>>>>> then
> >> > > >>>>>>>>>>> we would need to decide which of those partitions get a
> >> > > >>> share;
> >> > > >>>>>> which
> >> > > >>>>>>>> is
> >> > > >>>>>>>>> why
> >> > > >>>>>>>>>>> I'm wondering if it is easier to return empty for those
> >> > > >>>>> partitions
> >> > > >>>>>>>>> entirely
> >> > > >>>>>>>>>>> in the fetch response - they will make progress in the
> >> > > >>>>> subsequent
> >> > > >>>>>>>>> fetch. If
> >> > > >>>>>>>>>>> they don't make fast enough progress then that would be
> a
> >> > > >>> case
> >> > > >>>>> for
> >> > > >>>>>>>>> raising
> >> > > >>>>>>>>>>> the threshold or letting it complete at an off-peak
> time.
> >> > > >>>>>>>>>>>
> >> > > >>>>>>>>>>>
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>> With this approach, we need some new logic to handle
> >> > > >>>> throttling
> >> > > >>>>>> on
> >> > > >>>>>>>> the
> >> > > >>>>>>>>>>>> leader, but we can leave the replica threading model
> >> > > >>>> unchanged.
> >> > > >>>>>> So,
> >> > > >>>>>>>>>>>> overall, this still seems to be a simpler approach.
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>> Thanks,
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>> Jun
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>> On Tue, Aug 9, 2016 at 11:57 AM, Mayuresh Gharat <
> >> > > >>>>>>>>>>>> gharatmayures...@gmail.com <javascript:;>
> >> > > >>>>>>>>>>>>> wrote:
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>>> Nice write up Ben.
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>> I agree with Joel for keeping this simple by excluding
> >> > > >> the
> >> > > >>>>>>>> partitions
> >> > > >>>>>>>>>>>> from
> >> > > >>>>>>>>>>>>> the fetch request/response when the quota is violated
> at
> >> > > >>> the
> >> > > >>>>>>>> follower
> >> > > >>>>>>>>>>> or
> >> > > >>>>>>>>>>>>> leader instead of having a separate set of threads for
> >> > > >>>>> handling
> >> > > >>>>>>> the
> >> > > >>>>>>>>>>> quota
> >> > > >>>>>>>>>>>>> and non quota cases. Even though its different from
> the
> >> > > >>>>> current
> >> > > >>>>>>>> quota
> >> > > >>>>>>>>>>>>> implementation it should be OK since its internal to
> >> > > >>> brokers
> >> > > >>>>> and
> >> > > >>>>>>> can
> >> > > >>>>>>>>> be
> >> > > >>>>>>>>>>>>> handled by tuning the quota configs for it
> appropriately
> >> > > >>> by
> >> > > >>>>> the
> >> > > >>>>>>>>> admins.
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>> Also can you elaborate with an example how this would
> be
> >> > > >>>>>> handled :
> >> > > >>>>>>>>>>>>> *guaranteeing
> >> > > >>>>>>>>>>>>> ordering of updates when replicas shift threads*
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>> Thanks,
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>> Mayuresh
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>> On Tue, Aug 9, 2016 at 10:49 AM, Joel Koshy <
> >> > > >>>>>> jjkosh...@gmail.com <javascript:;>>
> >> > > >>>>>>>>>>> wrote:
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>> On the need for both leader/follower throttling: that
> >> > > >>> makes
> >> > > >>>>>>> sense -
> >> > > >>>>>>>>>>>>> thanks
> >> > > >>>>>>>>>>>>>> for clarifying. For completeness, can we add this
> >> > > >> detail
> >> > > >>> to
> >> > > >>>>> the
> >> > > >>>>>>>> doc -
> >> > > >>>>>>>>>>>>> say,
> >> > > >>>>>>>>>>>>>> after the quote that I pasted earlier?
> >> > > >>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>> From an implementation perspective though: I’m still
> >> > > >>>>> interested
> >> > > >>>>>>> in
> >> > > >>>>>>>>>>> the
> >> > > >>>>>>>>>>>>>> simplicity of not having to add separate replica
> >> > > >>> fetchers,
> >> > > >>>>>> delay
> >> > > >>>>>>>>>>> queue
> >> > > >>>>>>>>>>>> on
> >> > > >>>>>>>>>>>>>> the leader, and “move” partitions from the throttled
> >> > > >>>> replica
> >> > > >>>>>>>> fetchers
> >> > > >>>>>>>>>>>> to
> >> > > >>>>>>>>>>>>>> the regular replica fetchers once caught up.
> >> > > >>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>> Instead, I think it would work and be simpler to
> >> > > >> include
> >> > > >>> or
> >> > > >>>>>>> exclude
> >> > > >>>>>>>>>>> the
> >> > > >>>>>>>>>>>>>> partitions in the fetch request from the follower and
> >> > > >>> fetch
> >> > > >>>>>>>> response
> >> > > >>>>>>>>>>>> from
> >> > > >>>>>>>>>>>>>> the leader when the quota is violated. The issue of
> >> > > >>>> fairness
> >> > > >>>>>> that
> >> > > >>>>>>>> Ben
> >> > > >>>>>>>>>>>>> noted
> >> > > >>>>>>>>>>>>>> may be a wash between the two options (that Ben wrote
> >> > > >> in
> >> > > >>>> his
> >> > > >>>>>>>> email).
> >> > > >>>>>>>>>>>> With
> >> > > >>>>>>>>>>>>>> the default quota delay mechanism, partitions get
> >> > > >> delayed
> >> > > >>>>>>>> essentially
> >> > > >>>>>>>>>>>> at
> >> > > >>>>>>>>>>>>>> random - i.e., whoever fetches at the time of quota
> >> > > >>>> violation
> >> > > >>>>>>> gets
> >> > > >>>>>>>>>>>>> delayed
> >> > > >>>>>>>>>>>>>> at the leader. So we can adopt a similar policy in
> >> > > >>> choosing
> >> > > >>>>> to
> >> > > >>>>>>>>>>> truncate
> >> > > >>>>>>>>>>>>>> partitions in fetch responses. i.e., if at the time
> of
> >> > > >>>>> handling
> >> > > >>>>>>> the
> >> > > >>>>>>>>>>>> fetch
> >> > > >>>>>>>>>>>>>> the “effect” replication rate exceeds the quota then
> >> > > >>> either
> >> > > >>>>>> empty
> >> > > >>>>>>>> or
> >> > > >>>>>>>>>>>>>> truncate those partitions from the response. (BTW
> >> > > >> effect
> >> > > >>>>>>>> replication
> >> > > >>>>>>>>>>> is
> >> > > >>>>>>>>>>>>>> your terminology in the wiki - i.e., replication due
> to
> >> > > >>>>>> partition
> >> > > >>>>>>>>>>>>>> reassignment, adding brokers, etc.)
> >> > > >>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>> While this may be slightly different from the
> existing
> >> > > >>>> quota
> >> > > >>>>>>>>>>> mechanism
> >> > > >>>>>>>>>>>> I
> >> > > >>>>>>>>>>>>>> think the difference is small (since we would reuse
> the
> >> > > >>>> quota
> >> > > >>>>>>>> manager
> >> > > >>>>>>>>>>>> at
> >> > > >>>>>>>>>>>>>> worst with some refactoring) and will be internal to
> >> > > >> the
> >> > > >>>>>> broker.
> >> > > >>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>> So I guess the question is if this alternative is
> >> > > >> simpler
> >> > > >>>>>> enough
> >> > > >>>>>>>> and
> >> > > >>>>>>>>>>>>>> equally functional to not go with dedicated throttled
> >> > > >>>> replica
> >> > > >>>>>>>>>>> fetchers.
> >> > > >>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 9:44 AM, Jun Rao <
> >> > > >>> j...@confluent.io <javascript:;>>
> >> > > >>>>>>> wrote:
> >> > > >>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>> Just to elaborate on what Ben said why we need
> >> > > >>> throttling
> >> > > >>>> on
> >> > > >>>>>>> both
> >> > > >>>>>>>>>>> the
> >> > > >>>>>>>>>>>>>>> leader and the follower side.
> >> > > >>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>> If we only have throttling on the follower side,
> >> > > >>> consider
> >> > > >>>> a
> >> > > >>>>>> case
> >> > > >>>>>>>>>>> that
> >> > > >>>>>>>>>>>>> we
> >> > > >>>>>>>>>>>>>>> add 5 more new brokers and want to move some
> replicas
> >> > > >>> from
> >> > > >>>>>>>> existing
> >> > > >>>>>>>>>>>>>> brokers
> >> > > >>>>>>>>>>>>>>> over to those 5 brokers. Each of those broker is
> going
> >> > > >>> to
> >> > > >>>>>> fetch
> >> > > >>>>>>>>>>> data
> >> > > >>>>>>>>>>>>> from
> >> > > >>>>>>>>>>>>>>> all existing brokers. Then, it's possible that the
> >> > > >>>>> aggregated
> >> > > >>>>>>>> fetch
> >> > > >>>>>>>>>>>>> load
> >> > > >>>>>>>>>>>>>>> from those 5 brokers on a particular existing broker
> >> > > >>>> exceeds
> >> > > >>>>>> its
> >> > > >>>>>>>>>>>>> outgoing
> >> > > >>>>>>>>>>>>>>> network bandwidth, even though the inbounding
> traffic
> >> > > >> on
> >> > > >>>>> each
> >> > > >>>>>> of
> >> > > >>>>>>>>>>>> those
> >> > > >>>>>>>>>>>>> 5
> >> > > >>>>>>>>>>>>>>> brokers is bounded.
> >> > > >>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>> If we only have throttling on the leader side,
> >> > > >> consider
> >> > > >>>> the
> >> > > >>>>>> same
> >> > > >>>>>>>>>>>>> example
> >> > > >>>>>>>>>>>>>>> above. It's possible for the incoming traffic to
> each
> >> > > >> of
> >> > > >>>>>> those 5
> >> > > >>>>>>>>>>>>> brokers
> >> > > >>>>>>>>>>>>>> to
> >> > > >>>>>>>>>>>>>>> exceed its network bandwidth since it is fetching
> data
> >> > > >>>> from
> >> > > >>>>>> all
> >> > > >>>>>>>>>>>>> existing
> >> > > >>>>>>>>>>>>>>> brokers.
> >> > > >>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>> So, being able to set a quota on both the follower
> and
> >> > > >>> the
> >> > > >>>>>>> leader
> >> > > >>>>>>>>>>>> side
> >> > > >>>>>>>>>>>>>>> protects both cases.
> >> > > >>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>> Thanks,
> >> > > >>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>> Jun
> >> > > >>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 4:43 AM, Ben Stopford <
> >> > > >>>>>> b...@confluent.io <javascript:;>>
> >> > > >>>>>>>>>>>> wrote:
> >> > > >>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>> Hi Joel
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>> Thanks for taking the time to look at this.
> >> > > >>> Appreciated.
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>> Regarding throttling on both leader and follower,
> >> > > >> this
> >> > > >>>>>> proposal
> >> > > >>>>>>>>>>>>> covers
> >> > > >>>>>>>>>>>>>> a
> >> > > >>>>>>>>>>>>>>>> more general solution which can guarantee a quota,
> >> > > >> even
> >> > > >>>>> when
> >> > > >>>>>> a
> >> > > >>>>>>>>>>>>>> rebalance
> >> > > >>>>>>>>>>>>>>>> operation produces an asymmetric profile of load.
> >> > > >> This
> >> > > >>>>> means
> >> > > >>>>>>>>>>>>>>> administrators
> >> > > >>>>>>>>>>>>>>>> don’t need to calculate the impact that a
> >> > > >> follower-only
> >> > > >>>>> quota
> >> > > >>>>>>>>>>> will
> >> > > >>>>>>>>>>>>> have
> >> > > >>>>>>>>>>>>>>> on
> >> > > >>>>>>>>>>>>>>>> the leaders they are fetching from. So for example
> >> > > >>> where
> >> > > >>>>>>> replica
> >> > > >>>>>>>>>>>>> sizes
> >> > > >>>>>>>>>>>>>>> are
> >> > > >>>>>>>>>>>>>>>> skewed or where a partial rebalance is required.
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>> Having said that, even with both leader and
> follower
> >> > > >>>>> quotas,
> >> > > >>>>>>> the
> >> > > >>>>>>>>>>>> use
> >> > > >>>>>>>>>>>>> of
> >> > > >>>>>>>>>>>>>>>> additional threads is actually optional. There
> appear
> >> > > >>> to
> >> > > >>>> be
> >> > > >>>>>> two
> >> > > >>>>>>>>>>>>> general
> >> > > >>>>>>>>>>>>>>>> approaches (1) omit partitions from fetch requests
> >> > > >>>>>> (follower) /
> >> > > >>>>>>>>>>>> fetch
> >> > > >>>>>>>>>>>>>>>> responses (leader) when they exceed their quota (2)
> >> > > >>> delay
> >> > > >>>>>> them,
> >> > > >>>>>>>>>>> as
> >> > > >>>>>>>>>>>>> the
> >> > > >>>>>>>>>>>>>>>> existing quota mechanism does, using separate
> >> > > >> fetchers.
> >> > > >>>>> Both
> >> > > >>>>>>>>>>> appear
> >> > > >>>>>>>>>>>>>>> valid,
> >> > > >>>>>>>>>>>>>>>> but with slightly different design tradeoffs.
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>> The issue with approach (1) is that it departs
> >> > > >> somewhat
> >> > > >>>>> from
> >> > > >>>>>>> the
> >> > > >>>>>>>>>>>>>> existing
> >> > > >>>>>>>>>>>>>>>> quotas implementation, and must include a notion of
> >> > > >>>>> fairness
> >> > > >>>>>>>>>>>> within,
> >> > > >>>>>>>>>>>>>> the
> >> > > >>>>>>>>>>>>>>>> now size-bounded, request and response. The issue
> >> > > >> with
> >> > > >>>> (2)
> >> > > >>>>> is
> >> > > >>>>>>>>>>>>>>> guaranteeing
> >> > > >>>>>>>>>>>>>>>> ordering of updates when replicas shift threads,
> but
> >> > > >>> this
> >> > > >>>>> is
> >> > > >>>>>>>>>>>> handled,
> >> > > >>>>>>>>>>>>>> for
> >> > > >>>>>>>>>>>>>>>> the most part, in the code today.
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>> I’ve updated the rejected alternatives section to
> >> > > >> make
> >> > > >>>>> this a
> >> > > >>>>>>>>>>>> little
> >> > > >>>>>>>>>>>>>>>> clearer.
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>> B
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>> On 8 Aug 2016, at 20:38, Joel Koshy <
> >> > > >>>> jjkosh...@gmail.com <javascript:;>>
> >> > > >>>>>>>>>>> wrote:
> >> > > >>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>> Hi Ben,
> >> > > >>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>> Thanks for the detailed write-up. So the proposal
> >> > > >>>> involves
> >> > > >>>>>>>>>>>>>>>> self-throttling
> >> > > >>>>>>>>>>>>>>>>> on the fetcher side and throttling at the leader.
> >> > > >> Can
> >> > > >>>> you
> >> > > >>>>>>>>>>>> elaborate
> >> > > >>>>>>>>>>>>>> on
> >> > > >>>>>>>>>>>>>>>> the
> >> > > >>>>>>>>>>>>>>>>> reasoning that is given on the wiki: *“The
> throttle
> >> > > >> is
> >> > > >>>>>> applied
> >> > > >>>>>>>>>>> to
> >> > > >>>>>>>>>>>>>> both
> >> > > >>>>>>>>>>>>>>>>> leaders and followers. This allows the admin to
> >> > > >> exert
> >> > > >>>>> strong
> >> > > >>>>>>>>>>>>>> guarantees
> >> > > >>>>>>>>>>>>>>>> on
> >> > > >>>>>>>>>>>>>>>>> the throttle limit".* Is there any reason why one
> or
> >> > > >>> the
> >> > > >>>>>> other
> >> > > >>>>>>>>>>>>>> wouldn't
> >> > > >>>>>>>>>>>>>>>> be
> >> > > >>>>>>>>>>>>>>>>> sufficient.
> >> > > >>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>> Specifically, if we were to only do
> self-throttling
> >> > > >> on
> >> > > >>>> the
> >> > > >>>>>>>>>>>>> fetchers,
> >> > > >>>>>>>>>>>>>> we
> >> > > >>>>>>>>>>>>>>>>> could potentially avoid the additional replica
> >> > > >>> fetchers
> >> > > >>>>>> right?
> >> > > >>>>>>>>>>>>> i.e.,
> >> > > >>>>>>>>>>>>>>> the
> >> > > >>>>>>>>>>>>>>>>> replica fetchers would maintain its quota metrics
> as
> >> > > >>> you
> >> > > >>>>>>>>>>> proposed
> >> > > >>>>>>>>>>>>> and
> >> > > >>>>>>>>>>>>>>>> each
> >> > > >>>>>>>>>>>>>>>>> (normal) replica fetch presents an opportunity to
> >> > > >> make
> >> > > >>>>>>> progress
> >> > > >>>>>>>>>>>> for
> >> > > >>>>>>>>>>>>>> the
> >> > > >>>>>>>>>>>>>>>>> throttled partitions as long as their effective
> >> > > >>>>> consumption
> >> > > >>>>>>>>>>> rate
> >> > > >>>>>>>>>>>> is
> >> > > >>>>>>>>>>>>>>> below
> >> > > >>>>>>>>>>>>>>>>> the quota limit. If it exceeds the consumption
> rate
> >> > > >>> then
> >> > > >>>>>> don’t
> >> > > >>>>>>>>>>>>>> include
> >> > > >>>>>>>>>>>>>>>> the
> >> > > >>>>>>>>>>>>>>>>> throttled partitions in the subsequent fetch
> >> > > >> requests
> >> > > >>>>> until
> >> > > >>>>>>> the
> >> > > >>>>>>>>>>>>>>> effective
> >> > > >>>>>>>>>>>>>>>>> consumption rate for those partitions returns to
> >> > > >>> within
> >> > > >>>>> the
> >> > > >>>>>>>>>>> quota
> >> > > >>>>>>>>>>>>>>>> threshold.
> >> > > >>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>> I have more questions on the proposal, but was
> more
> >> > > >>>>>> interested
> >> > > >>>>>>>>>>> in
> >> > > >>>>>>>>>>>>> the
> >> > > >>>>>>>>>>>>>>>> above
> >> > > >>>>>>>>>>>>>>>>> to see if it could simplify things a bit.
> >> > > >>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>> Also, can you open up access to the google-doc
> that
> >> > > >>> you
> >> > > >>>>> link
> >> > > >>>>>>>>>>> to?
> >> > > >>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>> Thanks,
> >> > > >>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>> Joel
> >> > > >>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>> On Mon, Aug 8, 2016 at 5:54 AM, Ben Stopford <
> >> > > >>>>>>> b...@confluent.io <javascript:;>
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>> wrote:
> >> > > >>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>>> We’ve created KIP-73: Replication Quotas
> >> > > >>>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>>> The idea is to allow an admin to throttle moving
> >> > > >>>>> replicas.
> >> > > >>>>>>>>>>> Full
> >> > > >>>>>>>>>>>>>>> details
> >> > > >>>>>>>>>>>>>>>>>> are here:
> >> > > >>>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
> >> > > >>> confluence/display/KAFKA/KIP-
> >> > > >>>>> 73+
> >> > > >>>>>>>>>>>>>>>>>> Replication+Quotas <
> https://cwiki.apache.org/conf
> >> > > >>>>>>>>>>>>>>>>>> luence/display/KAFKA/KIP-73+Replication+Quotas>
> >> > > >>>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>>> Please take a look and let us know your thoughts.
> >> > > >>>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>>> Thanks
> >> > > >>>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>>> B
> >> > > >>>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>> --
> >> > > >>>>>>>>>>>>> -Regards,
> >> > > >>>>>>>>>>>>> Mayuresh R. Gharat
> >> > > >>>>>>>>>>>>> (862) 250-7125
> >> > > >>>>>>>>>>>>>
> >> > > >>>>>>>>>>>>
> >> > > >>>>>>>>>>>
> >> > > >>>>>>>>>
> >> > > >>>>>>>>>
> >> > > >>>>>>>>
> >> > > >>>>>>>>
> >> > > >>>>>>>> --
> >> > > >>>>>>>> -Regards,
> >> > > >>>>>>>> Mayuresh R. Gharat
> >> > > >>>>>>>> (862) 250-7125
> >> > > >>>>>>>>
> >> > > >>>>>>>
> >> > > >>>>>>
> >> > > >>>>>
> >> > > >>>>
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> --
> >> > > >>>> -Regards,
> >> > > >>>> Mayuresh R. Gharat
> >> > > >>>> (862) 250-7125
> >> > > >>>>
> >> > > >>>
> >> > > >>
> >> > > >>
> >> > > >> --
> >> > > >> Ben Stopford
> >> > > >>
> >> > >
> >> > >
> >> >
> >>
>
>
>
> --
> Gwen Shapira
> Product Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter | blog
>



-- 
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming



linkedin.com/in/toddpalino

Reply via email to