[jira] [Commented] (KAFKA-6399) Consider reducing "max.poll.interval.ms" default for Kafka Streams

John Roesler (JIRA) Thu, 08 Mar 2018 22:42:36 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392493#comment-16392493
 ]


John Roesler commented on KAFKA-6399:
-------------------------------------

I'm not sure, since I haven't had a lot of time so far to build up 
expectations, but here are a couple of thoughts...

I'm generally a fan of exercising your expectations, thus if you think the loop 
should be faster then 30s, then I'd say to go ahead and set it. If it turns out 
to be wrong, we'll learn something new.

The con to this viewpoint in this case is that potentially a lot of topologies 
are running with the default, and if 30s is too short, it could cause a lot of 
rebalancing. Then each affected person would have to investigate it and find 
out they need to set this config higher, and then tell us so we can adjust the 
default, so the OODA loop isn't very tight.

Plus, the reason to set it lower is to catch runaway applications and attempt 
to recover. So it seems reasonable to ask on what time scale would you be happy 
to see a long-running application detect and recover from runaway code. I think 
in general 5 minutes of backup won't cause too much problems.

So I guess, I'm falling more in the 5 minute camp, since it seems to me that 
it's likely to still help the 80% for whom 5 minutes is fine, without risking a 
lot of shenanigans in case the poll loop takes a little longer than we expect.

> Consider reducing "max.poll.interval.ms" default for Kafka Streams
> ------------------------------------------------------------------
>
>                 Key: KAFKA-6399
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6399
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>    Affects Versions: 1.0.0
>            Reporter: Matthias J. Sax
>            Assignee: Khaireddine Rezgui
>            Priority: Minor
>
> In Kafka {{0.10.2.1}} we change the default value of 
> {{max.poll.intervall.ms}} for Kafka Streams to {{Integer.MAX_VALUE}}. The 
> reason was that long state restore phases during rebalance could yield 
> "rebalance storms" as consumers drop out of a consumer group even if they are 
> healthy as they didn't call {{poll()}} during state restore phase.
> In version {{0.11}} and {{1.0}} the state restore logic was improved a lot 
> and thus, now Kafka Streams does call {{poll()}} even during restore phase. 
> Therefore, we might consider setting a smaller timeout for 
> {{max.poll.intervall.ms}} to detect bad behaving Kafka Streams applications 
> (ie, targeting user code) that don't make progress any more during regular 
> operations.
> The open question would be, what a good default might be. Maybe the actual 
> consumer default of 30 seconds might be sufficient. During one {{poll()}} 
> roundtrip, we would only call {{restoreConsumer.poll()}} once and restore a 
> single batch of records. This should take way less time than 30 seconds.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-6399) Consider reducing "max.poll.interval.ms" default for Kafka Streams

Reply via email to