Re: [VOTE] KIP-219 - Improve Quota Communication

Becket Qin Fri, 27 Apr 2018 11:37:45 -0700

The reason we wanted to bump up all the request versions was let the
clients know whether the broker has already throttled the request or not.
This avoids double throttling in both brokers and clients, if the clients
implementation supports KIP-219.


The new change uses only ApiVersionRequest, instead of all the requests, to
indicate whether the brokers have applied throttling or not. The difference
is that all the clients must issue ApiVersionRequest first before issuing
any other requests. This has no impact on the existing clients. But for
clients implementation that wants to support KIP-219, they need to follow
this practice, which seems to be reasonable.
Initially I thought the change is fine. However, thinking about this again,
that means the clients implementation needs to remember the ApiVersions of
each broker, which may have some complexity. Also this seems introduces the
behavior dependency between different requests, which seems unnecessary.

Due to the above reasons, I think it might be better to bump up all the
request versions.

Thanks,

Jiangjie (Becket) Qin


On Wed, Apr 25, 2018 at 3:19 PM, Jonghyun Lee <jonghy...@gmail.com> wrote:

> Hello,
>
> Per Ismael's suggestion, I'd like to get comments from the original voters
> for KIP-219 (Becket, Jun, Rajini) and others about the new interface change
> proposed in the discussion thread (
> https://markmail.org/search/?q=kafka+KIP-219#query:kafka%
> 20KIP-219+page:1+mid:5rwju2gwpicojr3f+state:results).
> Would you let me know?
>
> Thanks,
> Jon
>
>
> On Wed, Apr 25, 2018 at 2:43 PM, Dong Lin <lindon...@gmail.com> wrote:
>
> > +Jon
> >
> > On Mon, 22 Jan 2018 at 10:38 AM Becket Qin <becket....@gmail.com> wrote:
> >
> >> Thanks for the discussion and voting.
> >>
> >> KIP-219 has passed with +3 binding votes (Becket, Jun, Rajini).
> >>
> >> On Thu, Jan 18, 2018 at 1:32 AM, Rajini Sivaram <
> rajinisiva...@gmail.com>
> >> wrote:
> >>
> >> > Hi Becket,
> >> >
> >> > Thanks for the update. Yes, it does address my concern.
> >> >
> >> > +1 (binding)
> >> >
> >> > Regards,
> >> >
> >> > Rajini
> >> >
> >> > On Wed, Jan 17, 2018 at 5:24 PM, Becket Qin <becket....@gmail.com>
> >> wrote:
> >> >
> >> > > Actually returning an empty fetch request may still be useful to
> >> reduce
> >> > the
> >> > > throttle time due to request quota violation because the
> FetchResponse
> >> > send
> >> > > time will be less. I just updated the KIP.
> >> > >
> >> > > Rajini, does that address your concern?
> >> > >
> >> > > On Tue, Jan 16, 2018 at 7:01 PM, Becket Qin <becket....@gmail.com>
> >> > wrote:
> >> > >
> >> > > > Thanks for the reply, Jun.
> >> > > >
> >> > > > Currently the byte rate quota does not apply to HeartbeatRequest,
> >> > > > JoinGroupRequest/SyncGroupRequest. So the only case those
> requests
> >> are
> >> > > > throttled is because the request quota is violated. In that case,
> >> the
> >> > > > throttle time does not really matter whether we return a full
> >> > > FetchResponse
> >> > > > or an empty one. Would it be more consistent if we throttle based
> on
> >> > the
> >> > > > actual throttle time / channel mute time?
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jiangjie (Becket) Qin
> >> > > >
> >> > > > On Tue, Jan 16, 2018 at 3:45 PM, Jun Rao <j...@confluent.io>
> wrote:
> >> > > >
> >> > > >> Hi, Jiangjie,
> >> > > >>
> >> > > >> You are right that the heartbeat is done in a channel different
> >> from
> >> > the
> >> > > >> fetch request. I think it's still useful to return an empty fetch
> >> > > response
> >> > > >> when the quota is violated. This way, the throttle time for the
> >> > > heartbeat
> >> > > >> request won't be large. I agree that we can just mute the channel
> >> for
> >> > > the
> >> > > >> fetch request for the throttle time computed based on a full
> fetch
> >> > > >> response. This probably also partially addresses Rajini's #1
> >> concern.
> >> > > >>
> >> > > >> Thanks,
> >> > > >>
> >> > > >> Jun
> >> > > >>
> >> > > >> On Mon, Jan 15, 2018 at 9:27 PM, Becket Qin <
> becket....@gmail.com>
> >> > > wrote:
> >> > > >>
> >> > > >> > Hi Rajini,
> >> > > >> >
> >> > > >> > Thanks for the comments. Pleas see the reply inline.
> >> > > >> >
> >> > > >> > Hi Jun,
> >> > > >> >
> >> > > >> > Thinking about the consumer rebalance case a bit more, I am not
> >> sure
> >> > > if
> >> > > >> we
> >> > > >> > need to worry about the delayed rebalance due to quota
> violation
> >> or
> >> > > not.
> >> > > >> > The rebalance actually uses a separate channel to coordinator.
> >> > > Therefore
> >> > > >> > unless the request quota is hit, the rebalance won't be
> >> throttled.
> >> > > Even
> >> > > >> if
> >> > > >> > request quota is hit, it seems unlikely to be delayed long. So
> it
> >> > > looks
> >> > > >> > that we don't need to unmute the channel earlier than needed.
> >> Does
> >> > > this
> >> > > >> > address your concern?
> >> > > >> >
> >> > > >> > Thanks,
> >> > > >> >
> >> > > >> > Jiangjie (Becket) Qin
> >> > > >> >
> >> > > >> > On Mon, Jan 15, 2018 at 4:22 AM, Rajini Sivaram <
> >> > > >> rajinisiva...@gmail.com>
> >> > > >> > wrote:
> >> > > >> >
> >> > > >> > > Hi Becket,
> >> > > >> > >
> >> > > >> > > A few questions:
> >> > > >> > >
> >> > > >> > > 1. KIP says: *Although older client implementations (prior to
> >> > > >> knowledge
> >> > > >> > of
> >> > > >> > > this KIP) will immediately send the next request after the
> >> broker
> >> > > >> > responds
> >> > > >> > > without paying attention to the throttle time field, the
> >> broker is
> >> > > >> > > protected by virtue of muting the channel for time X. i.e.,
> the
> >> > next
> >> > > >> > > request will not be processed until the channel is unmuted. *
> >> > > >> > > For fetch requests, the response is sent immediately and the
> >> mute
> >> > > >> time on
> >> > > >> > > the broker based on empty fetch response will often be zero
> >> > (unlike
> >> > > >> the
> >> > > >> > > throttle time returned to the client based on non-empty
> >> response).
> >> > > >> Won't
> >> > > >> > > that lead to a tight loop of fetch requests from old clients
> >> > > >> > (particularly
> >> > > >> > > expensive with SSL)? Wouldn't it be better to retain current
> >> > > behaviour
> >> > > >> > for
> >> > > >> > > old clients? Also, if we change the behaviour for old
> clients,
> >> > > client
> >> > > >> > > metrics that track throttle time will report a lot of
> throttle
> >> > > >> unrelated
> >> > > >> > to
> >> > > >> > >  actual throttle time.
> >> > > >> > >
> >> > > >> > For consumers, if quota is violated, the throttle time on the
> >> broker
> >> > > >> will
> >> > > >> > not be 0. It is just that the throttle time will not be
> >> increasing
> >> > > >> because
> >> > > >> > the consumer will return an empty response in this case. So
> there
> >> > > should
> >> > > >> > not be a tight loop.
> >> > > >> >
> >> > > >> >
> >> > > >> > > 2. KIP says: *The usual idle timeout i.e.,
> >> > connections.max.idle.ms
> >> > > >> > > <http://connections.max.idle.ms> will still be honored
> during
> >> the
> >> > > >> > throttle
> >> > > >> > > time X. This makes sure that the brokers will detect client
> >> > > connection
> >> > > >> > > closure in a bounded time.*
> >> > > >> > > Wouldn't it be better to bound maximum throttle time to
> >> > > >> > > *connections.max.idle.ms
> >> > > >> > > <http://connections.max.idle.ms>*? If we mute for a time
> >> greater
> >> > > than
> >> > > >> > > *connections.max.idle.ms
> >> > > >> > > <http://connections.max.idle.ms>* and then close a client
> >> > > connection
> >> > > >> > > simply
> >> > > >> > > because we had muted it on the broker for a longer throttle
> >> time,
> >> > we
> >> > > >> > force
> >> > > >> > > a reconnection and read the next request from the new
> >> connection
> >> > > >> straight
> >> > > >> > > away. This feels the same as a bound on the quota of *
> >> > > >> > > connections.max.idle.ms
> >> > > >> > > <http://connections.max.idle.ms>*, but with added load on
> the
> >> > > broker
> >> > > >> for
> >> > > >> > > authenticating another connection (expensive with SSL).
> >> > > >> > >
> >> > > >> > I think we need to think about the consumer prior to and after
> >> this
> >> > > KIP
> >> > > >> > separately.
> >> > > >> >
> >> > > >> > For consumer prior to this KIP, even if we unmute the channel
> >> after
> >> > > >> > connection.max.idle.ms, it is likely that the consumers have
> >> > already
> >> > > >> > reached request.timeout.ms before that and has reconnected to
> >> the
> >> > > >> broker.
> >> > > >> > So there is no real difference whether we close the throttled
> >> > channel
> >> > > or
> >> > > >> > not.
> >> > > >> >
> >> > > >> > For consumers after the KIP, because they will honor the
> throttle
> >> > > time,
> >> > > >> > they will back off until throttle time is reached. If that
> >> throttle
> >> > > >> time is
> >> > > >> > longer than connection.max.idle.ms, it seems not a big
> overhead
> >> > > because
> >> > > >> > there will only be one connection re-establishment in quite a
> few
> >> > > >> minutes.
> >> > > >> > Compared with such overhead, it seems more important to honor
> the
> >> > > quota
> >> > > >> so
> >> > > >> > the broker can survive.
> >> > > >> >
> >> > > >> >
> >> > > >> > > 3. Are we changing the behaviour of network bandwidth quota
> for
> >> > > >> > > Produce/Fetch and retaining current behaviour for request
> >> quotas?
> >> > > >> > >
> >> > > >> > This is going to be applied to all the quota. Including request
> >> > > quotas.
> >> > > >> >
> >> > > >> >
> >> > > >> > >
> >> > > >> > > Thanks,
> >> > > >> > >
> >> > > >> > > Rajini
> >> > > >> > >
> >> > > >> > >
> >> > > >> > >
> >> > > >> > > On Wed, Jan 10, 2018 at 10:29 PM, Jun Rao <j...@confluent.io>
> >> > wrote:
> >> > > >> > >
> >> > > >> > > > Hi, Jiangjie,
> >> > > >> > > >
> >> > > >> > > > Thanks for the updated KIP. +1
> >> > > >> > > >
> >> > > >> > > > Jun
> >> > > >> > > >
> >> > > >> > > > On Mon, Jan 8, 2018 at 7:45 PM, Becket Qin <
> >> > becket....@gmail.com>
> >> > > >> > wrote:
> >> > > >> > > >
> >> > > >> > > > > Thanks for the comments, Jun.
> >> > > >> > > > >
> >> > > >> > > > > 1. Good point.
> >> > > >> > > > > 2. Also makes sense. Usually the connection.max.idle.ms
> is
> >> > high
> >> > > >> > enough
> >> > > >> > > > so
> >> > > >> > > > > the throttling is impacted.
> >> > > >> > > > >
> >> > > >> > > > > I have updated the KIP to reflect the changes.
> >> > > >> > > > >
> >> > > >> > > > > Thanks,
> >> > > >> > > > >
> >> > > >> > > > > Jiangjie (Becket) Qin
> >> > > >> > > > >
> >> > > >> > > > >
> >> > > >> > > > > On Mon, Jan 8, 2018 at 6:30 PM, Jun Rao <
> j...@confluent.io>
> >> > > wrote:
> >> > > >> > > > >
> >> > > >> > > > > > Hi, Jiangjie,
> >> > > >> > > > > >
> >> > > >> > > > > > Sorry for the late response. The proposal sounds good
> >> > > overall. A
> >> > > >> > > couple
> >> > > >> > > > > of
> >> > > >> > > > > > minor comments below.
> >> > > >> > > > > >
> >> > > >> > > > > > 1. For throttling a fetch request, we could potentially
> >> just
> >> > > >> send
> >> > > >> > an
> >> > > >> > > > > empty
> >> > > >> > > > > > response. We can return a throttle time calculated
> from a
> >> > full
> >> > > >> > > > response,
> >> > > >> > > > > > but only mute the channel on the server based on a
> >> throttle
> >> > > time
> >> > > >> > > > > calculated
> >> > > >> > > > > > based on the empty response. This has the benefit that
> >> the
> >> > > >> server
> >> > > >> > > will
> >> > > >> > > > > mute
> >> > > >> > > > > > the channel much shorter, which will prevent the
> consumer
> >> > from
> >> > > >> > > > > rebalancing
> >> > > >> > > > > > when throttled.
> >> > > >> > > > > >
> >> > > >> > > > > > 2. The wiki says "connections.max.idle.ms should be
> >> ignored
> >> > > >> during
> >> > > >> > > the
> >> > > >> > > > > > throttle time X." This has the potential issue that a
> >> server
> >> > > may
> >> > > >> > not
> >> > > >> > > > > detect
> >> > > >> > > > > > that a client connection is already gone until after an
> >> > > >> arbitrary
> >> > > >> > > > amount
> >> > > >> > > > > of
> >> > > >> > > > > > time. Perhaps we could still just close a connection if
> >> the
> >> > > >> server
> >> > > >> > > has
> >> > > >> > > > > > muted it for longer than connections.max.idle.ms. This
> >> will
> >> > > at
> >> > > >> > least
> >> > > >> > > > > bound
> >> > > >> > > > > > the time for a server to detect closed client
> >> connections.
> >> > > >> > > > > >
> >> > > >> > > > > > Thanks,
> >> > > >> > > > > >
> >> > > >> > > > > > Jun
> >> > > >> > > > > >
> >> > > >> > > > > >
> >> > > >> > > > > > On Mon, Nov 20, 2017 at 5:30 PM, Becket Qin <
> >> > > >> becket....@gmail.com>
> >> > > >> > > > > wrote:
> >> > > >> > > > > >
> >> > > >> > > > > > > Hi,
> >> > > >> > > > > > >
> >> > > >> > > > > > > We would like to start the voting thread for KIP-219.
> >> The
> >> > > KIP
> >> > > >> > > > proposes
> >> > > >> > > > > to
> >> > > >> > > > > > > improve the quota communication between the brokers
> and
> >> > > >> clients,
> >> > > >> > > > > > especially
> >> > > >> > > > > > > for cases of long throttling time.
> >> > > >> > > > > > >
> >> > > >> > > > > > > The KIP wiki is following:
> >> > > >> > > > > > > https://cwiki.apache.org/
> confluence/display/KAFKA/KIP-
> >> > > >> > > > > > 219+-+Improve+quota+
> >> > > >> > > > > > > communication
> >> > > >> > > > > > >
> >> > > >> > > > > > > The discussion thread is here:
> >> > > >> > > > > > > http://markmail.org/search/?q=
> >> kafka+KIP-219#query:kafka%
> >> > > >> > > > > > > 20KIP-219+page:1+mid:ooxabguy7nz7l7zy+state:results
> >> > > >> > > > > > >
> >> > > >> > > > > > > Thanks,
> >> > > >> > > > > > >
> >> > > >> > > > > > > Jiangjie (Becket) Qin
> >> > > >> > > > > > >
> >> > > >> > > > > >
> >> > > >> > > > >
> >> > > >> > > >
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [VOTE] KIP-219 - Improve Quota Communication

Reply via email to