Hey Jason and Guozhang,

Jason is right, I took this inspiration from KIP-144 (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-144%3A+Exponential+backoff+for+broker+reconnect+attempts)
which had the same logic in order to preserve the existing behavior. In
this case however, if we are thinking to completely eliminate the static
backoff behavior, we can do that and as Jason mentioned put it in the
release notes and not add any special logic. In addition I agree that we
should take the larger of the two of `retry.backoff.ms` and `
retry.backoff.max.ms`. I'll update the KIP to reflect this and make it
clear that the old static retry backoff is getting replaced by the new
dynamic retry backoff.

Thanks,
Sanjana
On Thu, Mar 19, 2020 at 4:23 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Guozhang,
>
> I was referring to this:
>
> > For users who have not set retry.backoff.ms explicitly, the default
> behavior will change so that the backoff will grow up to 1000 ms. For users
> who have set retry.backoff.ms explicitly, the behavior will remain the
> same
> as they could have specific requirements.
>
> I took this to mean that for users who have overridden `retry.backoff.ms`
> to 50ms (say), we will change the default `retry.backoff.max.ms` to 50ms
> as
> well in order to preserve existing backoff behavior. Is that not right? In
> any case, I agree that we can use the maximum of the two values as the
> effective `retry.backoff.max.ms` to handle the case when the configured
> value of `retry.backoff.ms` is larger than the default of 1s.
>
> -Jason
>
>
>
>
> On Thu, Mar 19, 2020 at 3:29 PM Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Hey Jason,
> >
> > My understanding is a bit different here: even if user has an explicit
> > overridden "retry.backoff.ms", the exponential mechanism still triggers
> > and
> > the backoff would be increased till "retry.backoff.max.ms"; and if the
> > specified "retry.backoff.ms" is already larger than the "
> > retry.backoff.max.ms", we would still take "retry.backoff.max.ms".
> >
> > So if the user does override the "retry.backoff.ms" to a value larger
> than
> > 1s and is not aware of the new config, she would be surprised to see the
> > specified value seemingly not being respected, but she could still learn
> > that afterwards by reading the release notes introducing this KIP
> anyways.
> >
> >
> > Guozhang
> >
> > On Thu, Mar 19, 2020 at 3:10 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi Sanjana,
> > >
> > > The KIP looks good to me. I had just one question about the default
> > > behavior. As I understand, if the user has specified `retry.backoff.ms
> `
> > > explicitly, then we will not apply the default max backoff. As such,
> > > there's no way to get the benefit of this feature if you are providing
> a
> > `
> > > retry.backoff.ms` unless you also provide `retry.backoff.max.ms`. That
> > > makes sense if you assume the user is unaware of the new configuration,
> > but
> > > it is surprising otherwise. Since it's not a semantic change and since
> > the
> > > default you're proposing of 1s is fairly low already, I wonder if it's
> > good
> > > enough to mention the new configuration in the release notes and not
> add
> > > any special logic. What do you think?
> > >
> > > -Jason
> > >
> > > On Thu, Mar 19, 2020 at 1:56 PM Sanjana Kaundinya <
> skaundi...@gmail.com>
> > > wrote:
> > >
> > > > Thank you for the comments Guozhang.
> > > >
> > > > I’ll leave this KIP out for discussion till the end of the week and
> > then
> > > > start a vote for this early next week.
> > > >
> > > > Sanjana
> > > >
> > > > On Mar 18, 2020, 3:38 PM -0700, Guozhang Wang <wangg...@gmail.com>,
> > > wrote:
> > > > > Hello Sanjana,
> > > > >
> > > > > Thanks for the proposed KIP, I think that makes a lot of sense --
> as
> > > you
> > > > > mentioned in the motivation, we've indeed seen many issues with
> > regard
> > > to
> > > > > the frequent retries, with bounded exponential backoff in the
> > scenario
> > > > > where there's a long connectivity issue we would effectively reduce
> > the
> > > > > request load by 10 given the default configs.
> > > > >
> > > > > For higher-level Streams client and Connect frameworks, today we
> also
> > > > have
> > > > > a retry logic but that's used in a slightly different way. For
> > example
> > > in
> > > > > Streams, we tend to handle the retry logic at the thread-level and
> > > hence
> > > > > very likely we'd like to change that mechanism in KIP-572 anyways.
> > For
> > > > > producer / consumer / admin clients, I think just applying this
> > > > behavioral
> > > > > change across these clients makes lot of sense. So I think can just
> > > leave
> > > > > the Streams / Connect out of the scope of this KIP to be addressed
> in
> > > > > separate discussions.
> > > > >
> > > > > I do not have further comments about this KIP :) LGTM.
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > > On Wed, Mar 18, 2020 at 12:09 AM Sanjana Kaundinya <
> > > skaundi...@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thanks for the feedback Boyang.
> > > > > >
> > > > > > If there’s anyone else who has feedback regarding this KIP, would
> > > > really
> > > > > > appreciate it hearing it!
> > > > > >
> > > > > > Thanks,
> > > > > > Sanjana
> > > > > >
> > > > > > On Tue, Mar 17, 2020 at 11:38 PM Boyang Chen <
> bche...@outlook.com>
> > > > wrote:
> > > > > >
> > > > > > > Sounds great!
> > > > > > >
> > > > > > > Get Outlook for iOS<https://aka.ms/o0ukef>
> > > > > > > ________________________________
> > > > > > > From: Sanjana Kaundinya <skaundi...@gmail.com>
> > > > > > > Sent: Tuesday, March 17, 2020 5:54:35 PM
> > > > > > > To: dev@kafka.apache.org <dev@kafka.apache.org>
> > > > > > > Subject: Re: [DISCUSS] KIP-580: Exponential Backoff for Kafka
> > > Clients
> > > > > > >
> > > > > > > Thanks for the explanation Boyang. One of the most common
> > problems
> > > > that
> > > > > > we
> > > > > > > have in Kafka is with respect to metadata fetches. For example,
> > if
> > > > there
> > > > > > is
> > > > > > > a broker failure, all clients start to fetch metadata at the
> same
> > > > time
> > > > > > and
> > > > > > > it often takes a while for the metadata to converge. In a high
> > load
> > > > > > > cluster, there are also issues where the volume of metadata has
> > > made
> > > > > > > convergence of metadata slower.
> > > > > > >
> > > > > > > For this case, exponential backoff helps as it reduces the
> retry
> > > > rate and
> > > > > > > spaces out how often clients will retry, thereby bringing down
> > the
> > > > time
> > > > > > for
> > > > > > > convergence. Something that Jason mentioned that would be a
> great
> > > > > > addition
> > > > > > > here would be if the backoff should be “jittered” as it was in
> > > > KIP-144
> > > > > > with
> > > > > > > respect to exponential reconnect backoff. This would help
> prevent
> > > the
> > > > > > > clients from being synchronized on when they retry, thereby
> > spacing
> > > > out
> > > > > > the
> > > > > > > number of requests being sent to the broker at the same time.
> > > > > > >
> > > > > > > I’ll add this example to the KIP and flush out more of the
> > details
> > > -
> > > > so
> > > > > > > it’s more clear.
> > > > > > >
> > > > > > > On Mar 17, 2020, 1:24 PM -0700, Boyang Chen <
> > > > reluctanthero...@gmail.com
> > > > > > > ,
> > > > > > > wrote:
> > > > > > > > Thanks for the reply Sanjana. I guess I would like to
> rephrase
> > my
> > > > > > > question
> > > > > > > > 2 and 3 as my previous response is a little bit unactionable.
> > > > > > > >
> > > > > > > > My specific point is that exponential backoff is not a silver
> > > > bullet
> > > > > > and
> > > > > > > we
> > > > > > > > should consider using it to solve known problems, instead of
> > > > making the
> > > > > > > > holistic changes to all clients in Kafka ecosystem. I do like
> > the
> > > > > > > > exponential backoff idea and believe this would be of great
> > > value,
> > > > but
> > > > > > > > maybe we should focus on proposing some existing modules that
> > are
> > > > > > > suffering
> > > > > > > > from static retry, and only change them in this first KIP. If
> > in
> > > > the
> > > > > > > > future, some other component users believe they are also
> > > > suffering, we
> > > > > > > > could get more minor KIPs to change the behavior as well.
> > > > > > > >
> > > > > > > > Boyang
> > > > > > > >
> > > > > > > > On Sun, Mar 15, 2020 at 12:07 AM Sanjana Kaundinya <
> > > > > > skaundi...@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the feedback Boyang, I will revise the KIP with
> > the
> > > > > > > > > mathematical relations as per your suggestion. To address
> > your
> > > > > > > feedback:
> > > > > > > > >
> > > > > > > > > 1. Currently, with the default of 100 ms per retry backoff,
> > in
> > > 1
> > > > > > second
> > > > > > > > > we would have 10 retries. In the case of using an
> exponential
> > > > > > backoff,
> > > > > > > we
> > > > > > > > > would have a total of 4 retries in 1 second. Thus we have
> > less
> > > > than
> > > > > > > half of
> > > > > > > > > the amount of retries in the same timeframe and can lessen
> > > broker
> > > > > > > pressure.
> > > > > > > > > This calculation is done as following (using the formula
> laid
> > > > out in
> > > > > > > the
> > > > > > > > > KIP:
> > > > > > > > >
> > > > > > > > > Try 1 at time 0 ms, failures = 0, next retry in 100 ms
> > (default
> > > > retry
> > > > > > > ms
> > > > > > > > > is initially 100 ms)
> > > > > > > > > Try 2 at time 100 ms, failures = 1, next retry in 200 ms
> > > > > > > > > Try 3 at time 300 ms, failures = 2, next retry in 400 ms
> > > > > > > > > Try 4 at time 700 ms, failures = 3, next retry in 800 ms
> > > > > > > > > Try 5 at time 1500 ms, failures = 4, next retry in 1000 ms
> > > > (default
> > > > > > max
> > > > > > > > > retry ms is 1000 ms)
> > > > > > > > >
> > > > > > > > > For 2 and 3, could you elaborate more about what you mean
> > with
> > > > > > respect
> > > > > > > to
> > > > > > > > > client timeouts? I’m not very familiar with the Streams
> > > > framework, so
> > > > > > > would
> > > > > > > > > love to get more insight to how that currently works, with
> > > > respect to
> > > > > > > > > producer transactions, so I can appropriately update the
> KIP
> > to
> > > > > > address
> > > > > > > > > these scenarios.
> > > > > > > > > On Mar 13, 2020, 7:15 PM -0700, Boyang Chen <
> > > > > > > reluctanthero...@gmail.com>,
> > > > > > > > > wrote:
> > > > > > > > > > Thanks for the KIP Sanjana. I think the motivation is
> good,
> > > but
> > > > > > lack
> > > > > > > of
> > > > > > > > > > more quantitative analysis. For instance:
> > > > > > > > > >
> > > > > > > > > > 1. How much retries we are saving by applying the
> > exponential
> > > > retry
> > > > > > > vs
> > > > > > > > > > static retry? There should be some mathematical relations
> > > > between
> > > > > > the
> > > > > > > > > > static retry ms, the initial exponential retry ms, the
> max
> > > > > > > exponential
> > > > > > > > > > retry ms in a given time interval.
> > > > > > > > > > 2. How does this affect the client timeout? With
> > exponential
> > > > retry,
> > > > > > > the
> > > > > > > > > > client shall be getting easier to timeout on a parent
> level
> > > > caller,
> > > > > > > for
> > > > > > > > > > instance stream attempts to retry initializing producer
> > > > > > transactions
> > > > > > > with
> > > > > > > > > > given 5 minute interval. With exponential retry this
> > > mechanism
> > > > > > could
> > > > > > > > > > experience more frequent timeout which we should be
> careful
> > > > with.
> > > > > > > > > > 3. With regards to #2, we should have more detailed
> > checklist
> > > > of
> > > > > > all
> > > > > > > the
> > > > > > > > > > existing static retry scenarios, and adjust the initial
> > > > exponential
> > > > > > > retry
> > > > > > > > > > ms to make sure we won't get easily timeout in high level
> > due
> > > > to
> > > > > > too
> > > > > > > few
> > > > > > > > > > attempts.
> > > > > > > > > >
> > > > > > > > > > Boyang
> > > > > > > > > >
> > > > > > > > > > On Fri, Mar 13, 2020 at 4:38 PM Sanjana Kaundinya <
> > > > > > > skaundi...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Everyone,
> > > > > > > > > > >
> > > > > > > > > > > I’ve written a KIP about introducing exponential
> backoff
> > > for
> > > > > > Kafka
> > > > > > > > > > > clients. Would appreciate any feedback on this.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-580%3A+Exponential+Backoff+for+Kafka+Clients
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Sanjana
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Reply via email to