Re: [DISCUSS] KIP-73: Replication Quotas

Todd Palino Thu, 18 Aug 2016 11:19:48 -0700

Yeah, I’m good with where we are right now on this KIP. It’s a workable
solution that we can add tooling to support. I would prefer to have soft
quotas, but given that it is more complex to implement, we can go with hard
quotas for the time being and consider it as an improvement later.


-Todd


On Thu, Aug 18, 2016 at 11:13 AM, Jun Rao <j...@confluent.io> wrote:

> Todd,
>
> Thanks for the detailed reply. So, it sounds like that you are ok with the
> current proposal in the KIP for now and we can brainstorm on more automated
> stuff separately? Are you comfortable with starting the vote on the current
> proposal?
>
> Jun
>
> On Thu, Aug 18, 2016 at 11:00 AM, Todd Palino <tpal...@gmail.com> wrote:
>
> > Joel just reminded me to take another look at this one :) So first off,
> > this is great. It’s something that we definitely need to have, especially
> > as we get into the realm of moving partitions around more often.
> >
> > I do prefer to have the cluster handle this automatically. What I
> envision
> > is a single configuration for “bootstrap replication quota” that is
> applied
> > when we have a replica that is in this situation. There’s 2 legitimate
> > cases that I’m aware of right now:
> > 1 - We are moving a partition to a new replica. We know about this (at
> > least the controller does), so we should be able to apply the quota
> without
> > too much trouble here
> > 2 - We have a broker that lost its disk and has to recover the partition
> > from the cluster. Harder to detect, but in this case, I’m not sure I even
> > want to throttle it because this is recovery activity.
> >
> > The problem with this becomes the question of “why”. Why are you moving a
> > partition? Are you doing it because you want to balance traffic? Or are
> you
> > doing it because you lost a piece of hardware and you need to get the RF
> > for the partition back up to the desired level? As an admin, these have
> > different priorities. I may be perfectly fine with having the replication
> > traffic saturate the cluster in the latter case, because reliability and
> > availability is more important than performance.
> >
> > Given the complexity of trying to determine intent, I’m going to agree
> with
> > implementing a manual procedure for now. We definitely need to have a
> > discussion about automating it as much as possible, but I think it’s part
> > of a larger conversation about how much automation should be built into
> the
> > broker itself, and how much should be part of a bolt-on “cluster
> manager”.
> > I’m not sure putting all that complexity into the broker is the right
> > choice.
> >
> > I do agree with Joel here that while a hard quota is typically better
> from
> > a client point of view, in the case of replication traffic a soft quota
> is
> > appropriate, and desirable. Probably a combination of both, as I think we
> > still want a hard limit that stops short of saturating the entire cluster
> > with replication traffic.
> >
> > -Todd
> >
> >
> > On Thu, Aug 18, 2016 at 10:21 AM, Joel Koshy <jjkosh...@gmail.com>
> wrote:
> >
> > > > For your first comment. We thought about determining "effect"
> replicas
> > > > automatically as well. First, there are some tricky stuff that one
> has
> > to
> > > >
> > >
> > > Auto-detection of effect traffic: i'm fairly certain it's doable but
> > > definitely tricky. I'm also not sure it is something worth tackling at
> > the
> > > outset. If we want to spend more time thinking over it even if it's
> just
> > an
> > > academic exercise I would be happy to brainstorm offline.
> > >
> > >
> > > > For your second comment, we discussed that in the client quotas
> > design. A
> > > > down side of that for client quotas is that a client may be surprised
> > > that
> > > > its traffic is not throttled at one time, but throttled as another
> with
> > > the
> > > > same quota (basically, less predicability). You can imaging setting a
> > > quota
> > > > for all replication traffic and only slow down the "effect" replicas
> if
> > > > needed. The thought is more or less the same as the above. It
> requires
> > > more
> > > >
> > >
> > > For clients, this is true. I think this is much less of an issue for
> > > server-side replication since the "users" here are the Kafka SREs who
> > > generally know these internal details.
> > >
> > > I think it would be valuable to get some feedback from SREs on the
> > proposal
> > > before proceeding to a vote. (ping Todd)
> > >
> > > Joel
> > >
> > >
> > > >
> > > > On Thu, Aug 18, 2016 at 9:37 AM, Ben Stopford <b...@confluent.io>
> > wrote:
> > > >
> > > > > Hi Joel
> > > > >
> > > > > Ha! yes we had some similar thoughts, on both counts. Both are
> > actually
> > > > > good approaches, but come with some extra complexity.
> > > > >
> > > > > Segregating the replication type is tempting as it creates a more
> > > general
> > > > > solution. One issue is you need to draw a line between lagging and
> > not
> > > > > lagging. The ISR ‘limit' is a tempting divider, but has the side
> > effect
> > > > > that, once you drop out you get immediately throttled. Adding a
> > > > > configurable divider is another option, but difficult for admins to
> > > set,
> > > > > and always a little arbitrary. A better idea is to prioritise, in
> > > reverse
> > > > > order to lag. But that also comes with additional complexity of its
> > > own.
> > > > >
> > > > > Under throttling is also a tempting addition. That’s to say, if
> > there’s
> > > > > idle bandwidth lying around, not being used, why not use it to let
> > > > lagging
> > > > > brokers catch up. This involves some comparison to the maximum
> > > bandwidth,
> > > > > which could be configurable, or could be derived, with pros and
> cons
> > > for
> > > > > each.
> > > > >
> > > > > But the more general problem is actually quite hard to reason
> about,
> > so
> > > > > after some discussion we decided to settle on something simple,
> that
> > we
> > > > > felt we could get working, and extend to add these additional
> > features
> > > as
> > > > > subsequent KIPs.
> > > > >
> > > > > I hope that seems reasonable. Jun may wish to add to this.
> > > > >
> > > > > B
> > > > >
> > > > >
> > > > > > On 18 Aug 2016, at 06:56, Joel Koshy <jjkosh...@gmail.com>
> wrote:
> > > > > >
> > > > > > On Wed, Aug 17, 2016 at 9:13 PM, Ben Stopford <b...@confluent.io>
> > > > wrote:
> > > > > >
> > > > > >>
> > > > > >> Let's us know if you have any further thoughts on KIP-73, else
> > we'll
> > > > > kick
> > > > > >> off a vote.
> > > > > >>
> > > > > >
> > > > > > I think the mechanism for throttling replicas looks good. Just
> had
> > a
> > > > few
> > > > > > more thoughts on the configuration section. What you have looks
> > > > > reasonable,
> > > > > > but I was wondering if it could be made simpler. You probably
> > thought
> > > > > > through these, so I'm curious to know your take.
> > > > > >
> > > > > > My guess is that most of the time, users would want to throttle
> all
> > > > > effect
> > > > > > replication - due to partition reassignments, adding brokers or a
> > > > broker
> > > > > > coming back online after an extended period of time. In all these
> > > > > scenarios
> > > > > > it may be possible to distinguish bootstrap (effect) vs normal
> > > > > replication
> > > > > > - based on how far the replica has to catch up. I'm wondering if
> it
> > > is
> > > > > > enough to just set an umbrella "effect" replication quota with
> > > perhaps
> > > > > > per-topic overrides (say if some topics are more important than
> > > others)
> > > > > as
> > > > > > opposed to designating throttled replicas.
> > > > > >
> > > > > > Also, IIRC during client-side quota discussions we had considered
> > the
> > > > > > possibility of allowing clients to go above their quotas when
> > > resources
> > > > > are
> > > > > > available. We ended up not doing that, but for replication
> > throttling
> > > > it
> > > > > > may make sense - i.e., to treat the quota as a soft limit.
> Another
> > > way
> > > > to
> > > > > > look at it is instead of ensuring "effect replication traffic
> does
> > > not
> > > > > flow
> > > > > > faster than X bytes/sec" it may be useful to instead ensure that
> > > > "effect
> > > > > > replication traffic only flows as slowly as necessary (so as not
> to
> > > > > > adversely affect normal replication traffic)."
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Joel
> > > > > >
> > > > > >>>
> > > > > >>>> On Thu, Aug 11, 2016 at 2:43 PM, Jun Rao <j...@confluent.io
> > > > > >>> <javascript:;>> wrote:
> > > > > >>>>
> > > > > >>>>> Hi, Joel,
> > > > > >>>>>
> > > > > >>>>> Yes, the response size includes both throttled and
> unthrottled
> > > > > >>> replicas.
> > > > > >>>>> However, the response is only delayed up to max.wait if the
> > > > response
> > > > > >>> size
> > > > > >>>>> is less than min.bytes, which matches the current behavior.
> So,
> > > > there
> > > > > >>> is
> > > > > >>>> no
> > > > > >>>>> extra delay to due throttling, right? For replica fetchers,
> the
> > > > > >> default
> > > > > >>>>> min.byte is 1. So, the response is only delayed if there is
> no
> > > byte
> > > > > >> in
> > > > > >>>> the
> > > > > >>>>> response, which is what we want.
> > > > > >>>>>
> > > > > >>>>> Thanks,
> > > > > >>>>>
> > > > > >>>>> Jun
> > > > > >>>>>
> > > > > >>>>> On Thu, Aug 11, 2016 at 11:53 AM, Joel Koshy <
> > > jjkosh...@gmail.com
> > > > > >>> <javascript:;>>
> > > > > >>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Hi Jun,
> > > > > >>>>>>
> > > > > >>>>>> I'm not sure that would work unless we have separate replica
> > > > > >>> fetchers,
> > > > > >>>>>> since this would cause all replicas (including ones that are
> > not
> > > > > >>>>> throttled)
> > > > > >>>>>> to get delayed. Instead, we could just have the leader
> > populate
> > > > the
> > > > > >>>>>> throttle-time field of the response as a hint to the
> follower
> > as
> > > > to
> > > > > >>> how
> > > > > >>>>>> long it should wait before it adds those replicas back to
> its
> > > > > >>>> subsequent
> > > > > >>>>>> replica fetch requests.
> > > > > >>>>>>
> > > > > >>>>>> Thanks,
> > > > > >>>>>>
> > > > > >>>>>> Joel
> > > > > >>>>>>
> > > > > >>>>>> On Thu, Aug 11, 2016 at 9:50 AM, Jun Rao <j...@confluent.io
> > > > > >>> <javascript:;>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Mayuresh,
> > > > > >>>>>>>
> > > > > >>>>>>> That's a good question. I think if the response size (after
> > > > > >> leader
> > > > > >>>>>>> throttling) is smaller than min.bytes, we will just delay
> the
> > > > > >>> sending
> > > > > >>>>> of
> > > > > >>>>>>> the response up to max.wait as we do now. This should
> prevent
> > > > > >>>> frequent
> > > > > >>>>>>> empty responses to the follower.
> > > > > >>>>>>>
> > > > > >>>>>>> Thanks,
> > > > > >>>>>>>
> > > > > >>>>>>> Jun
> > > > > >>>>>>>
> > > > > >>>>>>> On Wed, Aug 10, 2016 at 9:17 PM, Mayuresh Gharat <
> > > > > >>>>>>> gharatmayures...@gmail.com <javascript:;>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> This might have been answered before.
> > > > > >>>>>>>> I was wondering when the leader quota is reached and it
> > sends
> > > > > >>> empty
> > > > > >>>>>>>> response ( If the inclusion of a partition, listed in the
> > > > > >>> leader's
> > > > > >>>>>>>> throttled-replicas list, causes the LeaderQuotaRate to be
> > > > > >>> exceeded,
> > > > > >>>>>> that
> > > > > >>>>>>>> partition is omitted from the response (aka returns 0
> > > bytes).).
> > > > > >>> At
> > > > > >>>>> this
> > > > > >>>>>>>> point the follower quota is NOT reached and the follower
> is
> > > > > >> still
> > > > > >>>>> going
> > > > > >>>>>>> to
> > > > > >>>>>>>> ask for the that partition in the next fetch request.
> Would
> > it
> > > > > >> be
> > > > > >>>>> fair
> > > > > >>>>>> to
> > > > > >>>>>>>> add some logic there so that the follower backs off ( for
> > some
> > > > > >>>>>>> configurable
> > > > > >>>>>>>> time) from including those partitions in the next fetch
> > > > > >> request?
> > > > > >>>>>>>>
> > > > > >>>>>>>> Thanks,
> > > > > >>>>>>>>
> > > > > >>>>>>>> Mayuresh
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Wed, Aug 10, 2016 at 8:06 AM, Ben Stopford <
> > > > > >> b...@confluent.io
> > > > > >>> <javascript:;>>
> > > > > >>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Thanks again for the responses everyone. I’ve removed the
> > the
> > > > > >>>> extra
> > > > > >>>>>>>>> fetcher threads from the proposal, switching to the
> > > > > >>>> inclusion-based
> > > > > >>>>>>>>> approach. The relevant section is:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The follower makes a requests, using the fixed size of
> > > > > >>>>>>>>> replica.fetch.response.max.bytes as per KIP-74 <
> > > > > >>>>>>>> https://cwiki.apache.org/
> > > > > >>>>>>>>> confluence/display/KAFKA/KIP-
> > 74%3A+Add+Fetch+Response+Size+
> > > > > >>>>>>>> Limit+in+Bytes>.
> > > > > >>>>>>>>> The order of the partitions in the fetch request are
> > > > > >> randomised
> > > > > >>>> to
> > > > > >>>>>>> ensure
> > > > > >>>>>>>>> fairness.
> > > > > >>>>>>>>> When the leader receives the fetch request it processes
> the
> > > > > >>>>>> partitions
> > > > > >>>>>>> in
> > > > > >>>>>>>>> the defined order, up to the response's size limit. If
> the
> > > > > >>>>> inclusion
> > > > > >>>>>>> of a
> > > > > >>>>>>>>> partition, listed in the leader's throttled-replicas
> list,
> > > > > >>> causes
> > > > > >>>>> the
> > > > > >>>>>>>>> LeaderQuotaRate to be exceeded, that partition is omitted
> > > > > >> from
> > > > > >>>> the
> > > > > >>>>>>>> response
> > > > > >>>>>>>>> (aka returns 0 bytes). Logically, this is of the form:
> > > > > >>>>>>>>> var bytesAllowedForThrottledPartition =
> > > > > >>>>> quota.recordAndMaybeAdjust(
> > > > > >>>>>>>>> bytesRequestedForPartition)
> > > > > >>>>>>>>> When the follower receives the fetch response, if it
> > includes
> > > > > >>>>>>> partitions
> > > > > >>>>>>>>> in its throttled-partitions list, it increments the
> > > > > >>>>>> FollowerQuotaRate:
> > > > > >>>>>>>>> var includeThrottledPartitionsInNextRequest: Boolean =
> > > > > >>>>>>>>> quota.recordAndEvaluate(previousResponseThrottledBytes)
> > > > > >>>>>>>>> If the quota is exceeded, no throttled partitions will be
> > > > > >>>> included
> > > > > >>>>> in
> > > > > >>>>>>> the
> > > > > >>>>>>>>> next fetch request emitted by this replica fetcher
> thread.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> B
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> On 9 Aug 2016, at 23:34, Jun Rao <j...@confluent.io
> > > > > >>> <javascript:;>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> When there are several unthrottled replicas, we could
> also
> > > > > >>> just
> > > > > >>>>> do
> > > > > >>>>>>>> what's
> > > > > >>>>>>>>>> suggested in KIP-74. The client is responsible for
> > > > > >> reordering
> > > > > >>>> the
> > > > > >>>>>>>>>> partitions and the leader fills in the bytes to those
> > > > > >>>> partitions
> > > > > >>>>> in
> > > > > >>>>>>>>> order,
> > > > > >>>>>>>>>> up to the quota limit.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> We could also do what you suggested. If quota is
> exceeded,
> > > > > >>>>> include
> > > > > >>>>>>>> empty
> > > > > >>>>>>>>>> data in the response for throttled replicas. Keep doing
> > > > > >> that
> > > > > >>>>> until
> > > > > >>>>>>>> enough
> > > > > >>>>>>>>>> time has passed so that the quota is no longer exceeded.
> > > > > >> This
> > > > > >>>>>>>> potentially
> > > > > >>>>>>>>>> allows better batching per partition. Not sure if the
> two
> > > > > >>>> makes a
> > > > > >>>>>> big
> > > > > >>>>>>>>>> difference in practice though.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Jun
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> On Tue, Aug 9, 2016 at 2:31 PM, Joel Koshy <
> > > > > >>>> jjkosh...@gmail.com <javascript:;>>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On the leader side, one challenge is related to the
> > > > > >>> fairness
> > > > > >>>>>> issue
> > > > > >>>>>>>> that
> > > > > >>>>>>>>>>> Ben
> > > > > >>>>>>>>>>>> brought up. The question is what if the fetch response
> > > > > >>> limit
> > > > > >>>> is
> > > > > >>>>>>>> filled
> > > > > >>>>>>>>> up
> > > > > >>>>>>>>>>>> by the throttled replicas? If this happens constantly,
> > we
> > > > > >>>> will
> > > > > >>>>>>> delay
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>> progress of those un-throttled replicas. However, I
> > think
> > > > > >>> we
> > > > > >>>>> can
> > > > > >>>>>>>>> address
> > > > > >>>>>>>>>>>> this issue by trying to fill up the unthrottled
> replicas
> > > > > >> in
> > > > > >>>> the
> > > > > >>>>>>>>> response
> > > > > >>>>>>>>>>>> first. So, the algorithm would be. Fill up unthrottled
> > > > > >>>> replicas
> > > > > >>>>>> up
> > > > > >>>>>>> to
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>> fetch response limit. If there is space left, fill up
> > > > > >>>> throttled
> > > > > >>>>>>>>> replicas.
> > > > > >>>>>>>>>>>> If quota is exceeded for the throttled replicas,
> reduce
> > > > > >> the
> > > > > >>>>> bytes
> > > > > >>>>>>> in
> > > > > >>>>>>>>> the
> > > > > >>>>>>>>>>>> throttled replicas in the response accordingly.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>> Right - that's what I was trying to convey by
> truncation
> > > > > >> (vs
> > > > > >>>>>> empty).
> > > > > >>>>>>>> So
> > > > > >>>>>>>>> we
> > > > > >>>>>>>>>>> would attempt to fill the response for throttled
> > > > > >> partitions
> > > > > >>> as
> > > > > >>>>>> much
> > > > > >>>>>>> as
> > > > > >>>>>>>>> we
> > > > > >>>>>>>>>>> can before hitting the quota limit. There is one more
> > > > > >> detail
> > > > > >>>> to
> > > > > >>>>>>> handle
> > > > > >>>>>>>>> in
> > > > > >>>>>>>>>>> this: if there are several throttled partitions and not
> > > > > >>> enough
> > > > > >>>>>>>> remaining
> > > > > >>>>>>>>>>> allowance in the fetch response to include all the
> > > > > >> throttled
> > > > > >>>>>>> replicas
> > > > > >>>>>>>>> then
> > > > > >>>>>>>>>>> we would need to decide which of those partitions get a
> > > > > >>> share;
> > > > > >>>>>> which
> > > > > >>>>>>>> is
> > > > > >>>>>>>>> why
> > > > > >>>>>>>>>>> I'm wondering if it is easier to return empty for those
> > > > > >>>>> partitions
> > > > > >>>>>>>>> entirely
> > > > > >>>>>>>>>>> in the fetch response - they will make progress in the
> > > > > >>>>> subsequent
> > > > > >>>>>>>>> fetch. If
> > > > > >>>>>>>>>>> they don't make fast enough progress then that would
> be a
> > > > > >>> case
> > > > > >>>>> for
> > > > > >>>>>>>>> raising
> > > > > >>>>>>>>>>> the threshold or letting it complete at an off-peak
> time.
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> With this approach, we need some new logic to handle
> > > > > >>>> throttling
> > > > > >>>>>> on
> > > > > >>>>>>>> the
> > > > > >>>>>>>>>>>> leader, but we can leave the replica threading model
> > > > > >>>> unchanged.
> > > > > >>>>>> So,
> > > > > >>>>>>>>>>>> overall, this still seems to be a simpler approach.
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> Jun
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>> On Tue, Aug 9, 2016 at 11:57 AM, Mayuresh Gharat <
> > > > > >>>>>>>>>>>> gharatmayures...@gmail.com <javascript:;>
> > > > > >>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Nice write up Ben.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> I agree with Joel for keeping this simple by
> excluding
> > > > > >> the
> > > > > >>>>>>>> partitions
> > > > > >>>>>>>>>>>> from
> > > > > >>>>>>>>>>>>> the fetch request/response when the quota is violated
> > at
> > > > > >>> the
> > > > > >>>>>>>> follower
> > > > > >>>>>>>>>>> or
> > > > > >>>>>>>>>>>>> leader instead of having a separate set of threads
> for
> > > > > >>>>> handling
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>> quota
> > > > > >>>>>>>>>>>>> and non quota cases. Even though its different from
> the
> > > > > >>>>> current
> > > > > >>>>>>>> quota
> > > > > >>>>>>>>>>>>> implementation it should be OK since its internal to
> > > > > >>> brokers
> > > > > >>>>> and
> > > > > >>>>>>> can
> > > > > >>>>>>>>> be
> > > > > >>>>>>>>>>>>> handled by tuning the quota configs for it
> > appropriately
> > > > > >>> by
> > > > > >>>>> the
> > > > > >>>>>>>>> admins.
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Also can you elaborate with an example how this would
> > be
> > > > > >>>>>> handled :
> > > > > >>>>>>>>>>>>> *guaranteeing
> > > > > >>>>>>>>>>>>> ordering of updates when replicas shift threads*
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> Mayuresh
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> On Tue, Aug 9, 2016 at 10:49 AM, Joel Koshy <
> > > > > >>>>>> jjkosh...@gmail.com <javascript:;>>
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> On the need for both leader/follower throttling:
> that
> > > > > >>> makes
> > > > > >>>>>>> sense -
> > > > > >>>>>>>>>>>>> thanks
> > > > > >>>>>>>>>>>>>> for clarifying. For completeness, can we add this
> > > > > >> detail
> > > > > >>> to
> > > > > >>>>> the
> > > > > >>>>>>>> doc -
> > > > > >>>>>>>>>>>>> say,
> > > > > >>>>>>>>>>>>>> after the quote that I pasted earlier?
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> From an implementation perspective though: I’m still
> > > > > >>>>> interested
> > > > > >>>>>>> in
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>> simplicity of not having to add separate replica
> > > > > >>> fetchers,
> > > > > >>>>>> delay
> > > > > >>>>>>>>>>> queue
> > > > > >>>>>>>>>>>> on
> > > > > >>>>>>>>>>>>>> the leader, and “move” partitions from the throttled
> > > > > >>>> replica
> > > > > >>>>>>>> fetchers
> > > > > >>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>> the regular replica fetchers once caught up.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> Instead, I think it would work and be simpler to
> > > > > >> include
> > > > > >>> or
> > > > > >>>>>>> exclude
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>> partitions in the fetch request from the follower
> and
> > > > > >>> fetch
> > > > > >>>>>>>> response
> > > > > >>>>>>>>>>>> from
> > > > > >>>>>>>>>>>>>> the leader when the quota is violated. The issue of
> > > > > >>>> fairness
> > > > > >>>>>> that
> > > > > >>>>>>>> Ben
> > > > > >>>>>>>>>>>>> noted
> > > > > >>>>>>>>>>>>>> may be a wash between the two options (that Ben
> wrote
> > > > > >> in
> > > > > >>>> his
> > > > > >>>>>>>> email).
> > > > > >>>>>>>>>>>> With
> > > > > >>>>>>>>>>>>>> the default quota delay mechanism, partitions get
> > > > > >> delayed
> > > > > >>>>>>>> essentially
> > > > > >>>>>>>>>>>> at
> > > > > >>>>>>>>>>>>>> random - i.e., whoever fetches at the time of quota
> > > > > >>>> violation
> > > > > >>>>>>> gets
> > > > > >>>>>>>>>>>>> delayed
> > > > > >>>>>>>>>>>>>> at the leader. So we can adopt a similar policy in
> > > > > >>> choosing
> > > > > >>>>> to
> > > > > >>>>>>>>>>> truncate
> > > > > >>>>>>>>>>>>>> partitions in fetch responses. i.e., if at the time
> of
> > > > > >>>>> handling
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>> fetch
> > > > > >>>>>>>>>>>>>> the “effect” replication rate exceeds the quota then
> > > > > >>> either
> > > > > >>>>>> empty
> > > > > >>>>>>>> or
> > > > > >>>>>>>>>>>>>> truncate those partitions from the response. (BTW
> > > > > >> effect
> > > > > >>>>>>>> replication
> > > > > >>>>>>>>>>> is
> > > > > >>>>>>>>>>>>>> your terminology in the wiki - i.e., replication due
> > to
> > > > > >>>>>> partition
> > > > > >>>>>>>>>>>>>> reassignment, adding brokers, etc.)
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> While this may be slightly different from the
> existing
> > > > > >>>> quota
> > > > > >>>>>>>>>>> mechanism
> > > > > >>>>>>>>>>>> I
> > > > > >>>>>>>>>>>>>> think the difference is small (since we would reuse
> > the
> > > > > >>>> quota
> > > > > >>>>>>>> manager
> > > > > >>>>>>>>>>>> at
> > > > > >>>>>>>>>>>>>> worst with some refactoring) and will be internal to
> > > > > >> the
> > > > > >>>>>> broker.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> So I guess the question is if this alternative is
> > > > > >> simpler
> > > > > >>>>>> enough
> > > > > >>>>>>>> and
> > > > > >>>>>>>>>>>>>> equally functional to not go with dedicated
> throttled
> > > > > >>>> replica
> > > > > >>>>>>>>>>> fetchers.
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 9:44 AM, Jun Rao <
> > > > > >>> j...@confluent.io <javascript:;>>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Just to elaborate on what Ben said why we need
> > > > > >>> throttling
> > > > > >>>> on
> > > > > >>>>>>> both
> > > > > >>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>> leader and the follower side.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> If we only have throttling on the follower side,
> > > > > >>> consider
> > > > > >>>> a
> > > > > >>>>>> case
> > > > > >>>>>>>>>>> that
> > > > > >>>>>>>>>>>>> we
> > > > > >>>>>>>>>>>>>>> add 5 more new brokers and want to move some
> replicas
> > > > > >>> from
> > > > > >>>>>>>> existing
> > > > > >>>>>>>>>>>>>> brokers
> > > > > >>>>>>>>>>>>>>> over to those 5 brokers. Each of those broker is
> > going
> > > > > >>> to
> > > > > >>>>>> fetch
> > > > > >>>>>>>>>>> data
> > > > > >>>>>>>>>>>>> from
> > > > > >>>>>>>>>>>>>>> all existing brokers. Then, it's possible that the
> > > > > >>>>> aggregated
> > > > > >>>>>>>> fetch
> > > > > >>>>>>>>>>>>> load
> > > > > >>>>>>>>>>>>>>> from those 5 brokers on a particular existing
> broker
> > > > > >>>> exceeds
> > > > > >>>>>> its
> > > > > >>>>>>>>>>>>> outgoing
> > > > > >>>>>>>>>>>>>>> network bandwidth, even though the inbounding
> traffic
> > > > > >> on
> > > > > >>>>> each
> > > > > >>>>>> of
> > > > > >>>>>>>>>>>> those
> > > > > >>>>>>>>>>>>> 5
> > > > > >>>>>>>>>>>>>>> brokers is bounded.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> If we only have throttling on the leader side,
> > > > > >> consider
> > > > > >>>> the
> > > > > >>>>>> same
> > > > > >>>>>>>>>>>>> example
> > > > > >>>>>>>>>>>>>>> above. It's possible for the incoming traffic to
> each
> > > > > >> of
> > > > > >>>>>> those 5
> > > > > >>>>>>>>>>>>> brokers
> > > > > >>>>>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>>> exceed its network bandwidth since it is fetching
> > data
> > > > > >>>> from
> > > > > >>>>>> all
> > > > > >>>>>>>>>>>>> existing
> > > > > >>>>>>>>>>>>>>> brokers.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> So, being able to set a quota on both the follower
> > and
> > > > > >>> the
> > > > > >>>>>>> leader
> > > > > >>>>>>>>>>>> side
> > > > > >>>>>>>>>>>>>>> protects both cases.
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> Jun
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>> On Tue, Aug 9, 2016 at 4:43 AM, Ben Stopford <
> > > > > >>>>>> b...@confluent.io <javascript:;>>
> > > > > >>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Hi Joel
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Thanks for taking the time to look at this.
> > > > > >>> Appreciated.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Regarding throttling on both leader and follower,
> > > > > >> this
> > > > > >>>>>> proposal
> > > > > >>>>>>>>>>>>> covers
> > > > > >>>>>>>>>>>>>> a
> > > > > >>>>>>>>>>>>>>>> more general solution which can guarantee a quota,
> > > > > >> even
> > > > > >>>>> when
> > > > > >>>>>> a
> > > > > >>>>>>>>>>>>>> rebalance
> > > > > >>>>>>>>>>>>>>>> operation produces an asymmetric profile of load.
> > > > > >> This
> > > > > >>>>> means
> > > > > >>>>>>>>>>>>>>> administrators
> > > > > >>>>>>>>>>>>>>>> don’t need to calculate the impact that a
> > > > > >> follower-only
> > > > > >>>>> quota
> > > > > >>>>>>>>>>> will
> > > > > >>>>>>>>>>>>> have
> > > > > >>>>>>>>>>>>>>> on
> > > > > >>>>>>>>>>>>>>>> the leaders they are fetching from. So for example
> > > > > >>> where
> > > > > >>>>>>> replica
> > > > > >>>>>>>>>>>>> sizes
> > > > > >>>>>>>>>>>>>>> are
> > > > > >>>>>>>>>>>>>>>> skewed or where a partial rebalance is required.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> Having said that, even with both leader and
> follower
> > > > > >>>>> quotas,
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>> use
> > > > > >>>>>>>>>>>>> of
> > > > > >>>>>>>>>>>>>>>> additional threads is actually optional. There
> > appear
> > > > > >>> to
> > > > > >>>> be
> > > > > >>>>>> two
> > > > > >>>>>>>>>>>>> general
> > > > > >>>>>>>>>>>>>>>> approaches (1) omit partitions from fetch requests
> > > > > >>>>>> (follower) /
> > > > > >>>>>>>>>>>> fetch
> > > > > >>>>>>>>>>>>>>>> responses (leader) when they exceed their quota
> (2)
> > > > > >>> delay
> > > > > >>>>>> them,
> > > > > >>>>>>>>>>> as
> > > > > >>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>> existing quota mechanism does, using separate
> > > > > >> fetchers.
> > > > > >>>>> Both
> > > > > >>>>>>>>>>> appear
> > > > > >>>>>>>>>>>>>>> valid,
> > > > > >>>>>>>>>>>>>>>> but with slightly different design tradeoffs.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> The issue with approach (1) is that it departs
> > > > > >> somewhat
> > > > > >>>>> from
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>>>> existing
> > > > > >>>>>>>>>>>>>>>> quotas implementation, and must include a notion
> of
> > > > > >>>>> fairness
> > > > > >>>>>>>>>>>> within,
> > > > > >>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>> now size-bounded, request and response. The issue
> > > > > >> with
> > > > > >>>> (2)
> > > > > >>>>> is
> > > > > >>>>>>>>>>>>>>> guaranteeing
> > > > > >>>>>>>>>>>>>>>> ordering of updates when replicas shift threads,
> but
> > > > > >>> this
> > > > > >>>>> is
> > > > > >>>>>>>>>>>> handled,
> > > > > >>>>>>>>>>>>>> for
> > > > > >>>>>>>>>>>>>>>> the most part, in the code today.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> I’ve updated the rejected alternatives section to
> > > > > >> make
> > > > > >>>>> this a
> > > > > >>>>>>>>>>>> little
> > > > > >>>>>>>>>>>>>>>> clearer.
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>> B
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> On 8 Aug 2016, at 20:38, Joel Koshy <
> > > > > >>>> jjkosh...@gmail.com <javascript:;>>
> > > > > >>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Hi Ben,
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Thanks for the detailed write-up. So the proposal
> > > > > >>>> involves
> > > > > >>>>>>>>>>>>>>>> self-throttling
> > > > > >>>>>>>>>>>>>>>>> on the fetcher side and throttling at the leader.
> > > > > >> Can
> > > > > >>>> you
> > > > > >>>>>>>>>>>> elaborate
> > > > > >>>>>>>>>>>>>> on
> > > > > >>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>> reasoning that is given on the wiki: *“The
> throttle
> > > > > >> is
> > > > > >>>>>> applied
> > > > > >>>>>>>>>>> to
> > > > > >>>>>>>>>>>>>> both
> > > > > >>>>>>>>>>>>>>>>> leaders and followers. This allows the admin to
> > > > > >> exert
> > > > > >>>>> strong
> > > > > >>>>>>>>>>>>>> guarantees
> > > > > >>>>>>>>>>>>>>>> on
> > > > > >>>>>>>>>>>>>>>>> the throttle limit".* Is there any reason why one
> > or
> > > > > >>> the
> > > > > >>>>>> other
> > > > > >>>>>>>>>>>>>> wouldn't
> > > > > >>>>>>>>>>>>>>>> be
> > > > > >>>>>>>>>>>>>>>>> sufficient.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Specifically, if we were to only do
> self-throttling
> > > > > >> on
> > > > > >>>> the
> > > > > >>>>>>>>>>>>> fetchers,
> > > > > >>>>>>>>>>>>>> we
> > > > > >>>>>>>>>>>>>>>>> could potentially avoid the additional replica
> > > > > >>> fetchers
> > > > > >>>>>> right?
> > > > > >>>>>>>>>>>>> i.e.,
> > > > > >>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>> replica fetchers would maintain its quota metrics
> > as
> > > > > >>> you
> > > > > >>>>>>>>>>> proposed
> > > > > >>>>>>>>>>>>> and
> > > > > >>>>>>>>>>>>>>>> each
> > > > > >>>>>>>>>>>>>>>>> (normal) replica fetch presents an opportunity to
> > > > > >> make
> > > > > >>>>>>> progress
> > > > > >>>>>>>>>>>> for
> > > > > >>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>> throttled partitions as long as their effective
> > > > > >>>>> consumption
> > > > > >>>>>>>>>>> rate
> > > > > >>>>>>>>>>>> is
> > > > > >>>>>>>>>>>>>>> below
> > > > > >>>>>>>>>>>>>>>>> the quota limit. If it exceeds the consumption
> rate
> > > > > >>> then
> > > > > >>>>>> don’t
> > > > > >>>>>>>>>>>>>> include
> > > > > >>>>>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>>> throttled partitions in the subsequent fetch
> > > > > >> requests
> > > > > >>>>> until
> > > > > >>>>>>> the
> > > > > >>>>>>>>>>>>>>> effective
> > > > > >>>>>>>>>>>>>>>>> consumption rate for those partitions returns to
> > > > > >>> within
> > > > > >>>>> the
> > > > > >>>>>>>>>>> quota
> > > > > >>>>>>>>>>>>>>>> threshold.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> I have more questions on the proposal, but was
> more
> > > > > >>>>>> interested
> > > > > >>>>>>>>>>> in
> > > > > >>>>>>>>>>>>> the
> > > > > >>>>>>>>>>>>>>>> above
> > > > > >>>>>>>>>>>>>>>>> to see if it could simplify things a bit.
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Also, can you open up access to the google-doc
> that
> > > > > >>> you
> > > > > >>>>> link
> > > > > >>>>>>>>>>> to?
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> Joel
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>> On Mon, Aug 8, 2016 at 5:54 AM, Ben Stopford <
> > > > > >>>>>>> b...@confluent.io <javascript:;>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>> wrote:
> > > > > >>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> We’ve created KIP-73: Replication Quotas
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> The idea is to allow an admin to throttle moving
> > > > > >>>>> replicas.
> > > > > >>>>>>>>>>> Full
> > > > > >>>>>>>>>>>>>>> details
> > > > > >>>>>>>>>>>>>>>>>> are here:
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
> > > > > >>> confluence/display/KAFKA/KIP-
> > > > > >>>>> 73+
> > > > > >>>>>>>>>>>>>>>>>> Replication+Quotas <
> https://cwiki.apache.org/conf
> > > > > >>>>>>>>>>>>>>>>>> luence/display/KAFKA/KIP-73+Replication+Quotas>
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Please take a look and let us know your
> thoughts.
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> Thanks
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>> B
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>> --
> > > > > >>>>>>>>>>>>> -Regards,
> > > > > >>>>>>>>>>>>> Mayuresh R. Gharat
> > > > > >>>>>>>>>>>>> (862) 250-7125
> > > > > >>>>>>>>>>>>>
> > > > > >>>>>>>>>>>>
> > > > > >>>>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> --
> > > > > >>>>>>>> -Regards,
> > > > > >>>>>>>> Mayuresh R. Gharat
> > > > > >>>>>>>> (862) 250-7125
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> --
> > > > > >>>> -Regards,
> > > > > >>>> Mayuresh R. Gharat
> > > > > >>>> (862) 250-7125
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Ben Stopford
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > *Todd Palino*
> > Staff Site Reliability Engineer
> > Data Infrastructure Streaming
> >
> >
> >
> > linkedin.com/in/toddpalino
> >
>



-- 
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming



linkedin.com/in/toddpalino

Re: [DISCUSS] KIP-73: Replication Quotas

Reply via email to