Re: [VOTE] KIP-572: Improve timeouts and retires in Kafka Streams

Matthias J. Sax Wed, 10 Jun 2020 11:17:24 -0700

Thanks!

+1 (binding) from myself.



I am closing the vote as accepted with 3 binding and 3 non-binding votes.

binding:
 - John
 - Guozhang
 - Matthias

non-binding:
 - Sophie
 - Boyang
 - Bruno



-Matthias

On 6/4/20 5:26 PM, Matthias J. Sax wrote:
> Guozhang,
> 
> what you propose makes sense, but this seems to get into implementation
> detail territory already?
> 
> Thus, if there are nor further change requests to the KIP wiki page
> itself, I would like to proceed with the VOTE.
> 
> 
> -Matthias
> 
> On 5/20/20 12:30 PM, Guozhang Wang wrote:
>> Thanks Matthias,
>>
>> I agree with you on all the bullet points above. Regarding the admin-client
>> outer-loop retries inside partition assignor, I think we should treat error
>> codes differently from those two blocking calls:
>>
>> Describe-topic:
>> * unknown-topic (3): add this topic to the to-be-created topic list.
>> * leader-not-available (5): do not try to create, retry in the outer loop.
>> * request-timeout: break the current loop and retry in the outer loop.
>> * others: fatal error.
>>
>> Create-topic:
>> * topic-already-exists: retry in the outer loop to validate the
>> num.partitions match expectation.
>> * request-timeout: break the current loop and retry in the outer loop.
>> * others: fatal error.
>>
>> And in the outer-loop, I think we can have a global timer for the whole
>> "assign()" function, not only for the internal-topic-manager, and the timer
>> can be hard-coded with, e.g. half of the rebalance.timeout to get rid of
>> the `retries`; if we cannot complete the assignment before the timeout runs
>> out, we can return just the partial assignment (e.g. if there are two
>> tasks, but we can only get the topic metadata for one of them, then just do
>> the assignment for that one only) while encoding in the error-code field to
>> request for another rebalance.
>>
>> Guozhang
>>
>>
>>
>> On Mon, May 18, 2020 at 7:26 PM Matthias J. Sax <mj...@apache.org> wrote:
>>
>>> No worries Guozhang, any feedback is always very welcome! My reply is
>>> going to be a little longer... Sorry.
>>>
>>>
>>>
>>>> 1) There are some inconsistent statements in the proposal regarding what
>>> to
>>>> deprecated:
>>>
>>> The proposal of the KIP is to deprecate `retries` for producer, admin,
>>> and Streams. Maybe the confusion is about the dependency of those
>>> settings within Streams and that we handle the deprecation somewhat
>>> different for plain clients vs Streams:
>>>
>>> For plain producer/admin the default `retries` is set to MAX_VALUE. The
>>> config will be deprecated but still be respected.
>>>
>>> For Streams, the default `retries` is set to zero, however, this default
>>> retry does _not_ affect the embedded producer/admin clients -- both
>>> clients stay on their own default of MAX_VALUES.
>>>
>>> Currently, this introduces the issue, that if a user wants to increase
>>> Streams retries, she might by accident reduce the embedded client
>>> retries, too. To avoid this issue, she would need to set
>>>
>>> retries=100
>>> producer.retires=MAX_VALUE
>>> admin.retries=MAX_VALUE
>>>
>>> This KIP will fix this issue only in the long term though, ie, when
>>> `retries` is finally removed. Short term, using `retries` in
>>> StreamsConfig would still affect the embedded clients, but Streams, as
>>> well as both client would log a WARN message. This preserves backward
>>> compatibility.
>>>
>>> Withing Streams `retries` is ignored and the new `task.timeout.ms` is
>>> used instead. This increase the default resilience of Kafka Streams
>>> itself. We could also achieve this by still respecting `retries` and to
>>> change it's default value. However, because we deprecate `retries` it
>>> seems better to just ignore it and switch to the new config directly.
>>>
>>> I updated the KIPs with some more details.
>>>
>>>
>>>
>>>> 2) We should also document the related behavior change in
>>> PartitionAssignor
>>>> that uses AdminClient.
>>>
>>> This is actually a good point. Originally, I looked into this only
>>> briefly, but it raised an interesting question on how to handle it.
>>>
>>> Note that `TimeoutExceptions` are currently not handled in this retry
>>> loop. Also note that the default retries value for other errors would be
>>> MAX_VALUE be default (inherited from `AdminClient#retries` as mentioned
>>> already by Guozhang).
>>>
>>> Applying the new `task.timeout.ms` config does not seem to be
>>> appropriate because the AdminClient is used during a rebalance in the
>>> leader. We could introduce a new config just for this case, but it seems
>>> to be a little bit too much. Furthermore, the group-coordinator applies
>>> a timeout on the leader anyway: if the assignment is not computed within
>>> the timeout, the leader is removed from the group and another rebalance
>>> is triggered.
>>>
>>> Overall, we make multiple admin client calls and thus we should keep
>>> some retry logic and not just rely on the admin client internal retries,
>>> as those would fall short to retry different calls interleaved. We could
>>> just retry infinitely and rely on the group coordinator to remove the
>>> leader eventually. However, this does not seem to be ideal because the
>>> removed leader might be stuck forever.
>>>
>>> The question though is: if topic metadata cannot be obtained or internal
>>> topics cannot be created, what should we do? We can't compute an
>>> assignment anyway. We have already an rebalance error code to shut down
>>> all instances for this case. Maybe we could break the retry loop before
>>> the leader is kicked out of the group and send this error code? This way
>>> we don't need a new config, but piggy-back on the existing timeout to
>>> compute the assignment. To be conservative, we could use a 50% threshold?
>>>
>>>
>>>
>>>> BTW as I mentioned in the previous statement, today throwing an exception
>>>> that kills one thread but not the whole instance is still an issue for
>>>> monitoring purposes, but I suppose this is not going to be in this KIP
>>> but
>>>> addressed by another KIP, right?
>>>
>>> Correct. This issue if out-of-scope.
>>>
>>>
>>>
>>>> for consumer, if it gets a
>>>> TimeoutException polling records, would start timing all tasks since that
>>>> single consumer would affect all tasks?
>>>
>>> Consumer#poll() would never throw a `TimeoutException` and thus
>>> `task.timeout.ms` does not apply.
>>>
>>>
>>>
>>>> For other blocking calls like
>>>> `endOffsets()` etc, they are usually also issued on behalf of a batch of
>>>> tasks, so if that gets timeout exception should we start ticking all the
>>>> corresponding tasks as well? Maybe worth clarifying a bit more in the
>>> wiki.
>>>
>>> Good point. I agree that the timer should tick for all affected tasks. I
>>> clarified in the KIP.
>>>
>>>
>>>
>>> About KAFKA-6520:
>>>
>>> There is already KIP-457 and I am not sure this KIP should subsume it.
>>> It somewhat feels orthogonal. I am also not 100% sure if KIP-572
>>> actually helps much, because a thread could be disconnected to the
>>> brokers without throwing any timeout exception: if all tasks are RUNNING
>>> and just polling for new data, but no progress is made because of a
>>> network issue, `poll()` would just return no data but not through.
>>>
>>>
>>>
>>> @Bruno
>>>
>>>> Wouldn't it be better to specify
>>>> task.timeout.ms to -1 if no retry should be done
>>>
>>> Interesting idea. Personally I find `-1` confusing. And it seems
>>> intuitive to me that `0` implies no retries (this seems to be in
>>> alignment to other APIs).
>>>
>>>
>>>
>>> -Matthias
>>>
>>>
>>> On 5/18/20 9:53 AM, Guozhang Wang wrote:
>>>> Hi Matthias,
>>>>
>>>> Sorry for flooding the thread, but with this KIP I feel the design scope
>>> of
>>>> https://issues.apache.org/jira/browse/KAFKA-6520 can be simplified a lot
>>>> and may it the design can be just piggy-backed as part of this KIP, wdyt?
>>>>
>>>> Guozhang
>>>>
>>>>
>>>> On Mon, May 18, 2020 at 9:47 AM Guozhang Wang <wangg...@gmail.com>
>>> wrote:
>>>>
>>>>> Hi Matthias,
>>>>>
>>>>> Just to add one more meta comment: for consumer, if it gets a
>>>>> TimeoutException polling records, would start timing all tasks since
>>> that
>>>>> single consumer would affect all tasks? For other blocking calls like
>>>>> `endOffsets()` etc, they are usually also issued on behalf of a batch of
>>>>> tasks, so if that gets timeout exception should we start ticking all the
>>>>> corresponding tasks as well? Maybe worth clarifying a bit more in the
>>> wiki.
>>>>>
>>>>>
>>>>> Guozhang
>>>>>
>>>>> On Mon, May 18, 2020 at 12:26 AM Bruno Cadonna <br...@confluent.io>
>>> wrote:
>>>>>
>>>>>> Hi Matthias,
>>>>>>
>>>>>> I am +1 (non-binding) on the KIP.
>>>>>>
>>>>>> Just one final remark: Wouldn't it be better to specify
>>>>>> task.timeout.ms to -1 if no retry should be done? IMO it would make
>>>>>> the config more intuitive because 0 would not have two possible
>>>>>> meanings (i.e. try once and never try) anymore.
>>>>>>
>>>>>> Best,
>>>>>> Bruno
>>>>>>
>>>>>> On Sat, May 16, 2020 at 7:51 PM Guozhang Wang <wangg...@gmail.com>
>>> wrote:
>>>>>>>
>>>>>>> Hello Matthias,
>>>>>>>
>>>>>>> Thanks for the updated KIP, overall I'm +1 on this proposal. Some
>>> minor
>>>>>>> comments (I know gmail mixed that again for me so I'm leaving it as a
>>>>>> combo
>>>>>>> for both DISCUSS and VOTE :)
>>>>>>>
>>>>>>> 1) There are some inconsistent statements in the proposal regarding
>>>>>> what to
>>>>>>> deprecated: at the beginning it says "We propose to deprecate the
>>>>>> retries
>>>>>>> configuration parameter for the producer and admin client" but later
>>> in
>>>>>>> compatibility we say "Producer and admin client behavior does not
>>>>>> change;
>>>>>>> both still respect retries config." My understanding is that we will
>>>>>> only
>>>>>>> deprecate the StreamsConfig#retries while not touch on
>>>>>>> ProducerConfig/AdminClientConfig#retries, AND we will always override
>>>>>> the
>>>>>>> embedded producer / admin retries config to infinity so that we never
>>>>>> rely
>>>>>>> on those configs but always bounded by the timeout config. Is that
>>>>>>> right, if yes could you clarify in the doc?
>>>>>>>
>>>>>>> 2) We should also document the related behavior change in
>>>>>> PartitionAssignor
>>>>>>> that uses AdminClient. More specifically, the admin client's retries
>>>>>> config
>>>>>>> is piggy-backed inside InternalTopicManager as an outer-loop retry
>>>>>> logic in
>>>>>>> addition to AdminClient's own inner retry loop. There are some
>>> existing
>>>>>>> issues like KAFKA-9999 / 10006 that Sophie and Boyang has been working
>>>>>> on.
>>>>>>> I exchanged some ideas with them, and generally we should consider if
>>>>>> the
>>>>>>> outer-loop of InternalTopicManager should just be removed and if we
>>> got
>>>>>>> TimeoutException we should just trigger another rebalance etc.
>>>>>>>
>>>>>>> BTW as I mentioned in the previous statement, today throwing an
>>>>>> exception
>>>>>>> that kills one thread but not the whole instance is still an issue for
>>>>>>> monitoring purposes, but I suppose this is not going to be in this KIP
>>>>>> but
>>>>>>> addressed by another KIP, right?
>>>>>>>
>>>>>>>
>>>>>>> Guozhang
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 15, 2020 at 1:14 PM Boyang Chen <
>>> reluctanthero...@gmail.com
>>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Good, good.
>>>>>>>>
>>>>>>>> Read through the discussions, the KIP looks good to me, +1
>>>>>> (non-binding)
>>>>>>>>
>>>>>>>> On Fri, May 15, 2020 at 11:51 AM Sophie Blee-Goldman <
>>>>>> sop...@confluent.io>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Called out!
>>>>>>>>>
>>>>>>>>> Seems like gmail struggles with [...] prefixed subjects. You'd
>>>>>> think they
>>>>>>>>> would adapt
>>>>>>>>> all their practices to conform to the Apache Kafka mailing list
>>>>>>>> standards,
>>>>>>>>> but no!
>>>>>>>>>
>>>>>>>>> +1 (non-binding) by the way
>>>>>>>>>
>>>>>>>>> On Fri, May 15, 2020 at 11:46 AM John Roesler <vvcep...@apache.org>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Boyang,
>>>>>>>>>>
>>>>>>>>>> It is a separate thread, and you have just revealed yourself as a
>>>>>> gmail
>>>>>>>>>> user ;)
>>>>>>>>>>
>>>>>>>>>> (Gmail sometimes conflates vote and discuss threads for no
>>>>>> apparent
>>>>>>>>> reason
>>>>>>>>>> )
>>>>>>>>>>
>>>>>>>>>> -John
>>>>>>>>>>
>>>>>>>>>> On Fri, May 15, 2020, at 13:39, Boyang Chen wrote:
>>>>>>>>>>> Hey Matthias, should this be on a separate VOTE thread?
>>>>>>>>>>>
>>>>>>>>>>> On Fri, May 15, 2020 at 11:38 AM John Roesler <
>>>>>> vvcep...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Matthias!
>>>>>>>>>>>>
>>>>>>>>>>>> I’m +1 (binding)
>>>>>>>>>>>>
>>>>>>>>>>>> -John
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, May 15, 2020, at 11:55, Matthias J. Sax wrote:
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like to start the vote on KIP-572:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-572%3A+Improve+timeouts+and+retries+in+Kafka+Streams
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Attachments:
>>>>>>>>>>>>> * signature.asc
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> -- Guozhang
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -- Guozhang
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [VOTE] KIP-572: Improve timeouts and retires in Kafka Streams

Reply via email to