[ https://issues.apache.org/jira/browse/KAFKA-6399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392493#comment-16392493 ]
John Roesler commented on KAFKA-6399: ------------------------------------- I'm not sure, since I haven't had a lot of time so far to build up expectations, but here are a couple of thoughts... I'm generally a fan of exercising your expectations, thus if you think the loop should be faster then 30s, then I'd say to go ahead and set it. If it turns out to be wrong, we'll learn something new. The con to this viewpoint in this case is that potentially a lot of topologies are running with the default, and if 30s is too short, it could cause a lot of rebalancing. Then each affected person would have to investigate it and find out they need to set this config higher, and then tell us so we can adjust the default, so the OODA loop isn't very tight. Plus, the reason to set it lower is to catch runaway applications and attempt to recover. So it seems reasonable to ask on what time scale would you be happy to see a long-running application detect and recover from runaway code. I think in general 5 minutes of backup won't cause too much problems. So I guess, I'm falling more in the 5 minute camp, since it seems to me that it's likely to still help the 80% for whom 5 minutes is fine, without risking a lot of shenanigans in case the poll loop takes a little longer than we expect. > Consider reducing "max.poll.interval.ms" default for Kafka Streams > ------------------------------------------------------------------ > > Key: KAFKA-6399 > URL: https://issues.apache.org/jira/browse/KAFKA-6399 > Project: Kafka > Issue Type: Improvement > Components: streams > Affects Versions: 1.0.0 > Reporter: Matthias J. Sax > Assignee: Khaireddine Rezgui > Priority: Minor > > In Kafka {{0.10.2.1}} we change the default value of > {{max.poll.intervall.ms}} for Kafka Streams to {{Integer.MAX_VALUE}}. The > reason was that long state restore phases during rebalance could yield > "rebalance storms" as consumers drop out of a consumer group even if they are > healthy as they didn't call {{poll()}} during state restore phase. > In version {{0.11}} and {{1.0}} the state restore logic was improved a lot > and thus, now Kafka Streams does call {{poll()}} even during restore phase. > Therefore, we might consider setting a smaller timeout for > {{max.poll.intervall.ms}} to detect bad behaving Kafka Streams applications > (ie, targeting user code) that don't make progress any more during regular > operations. > The open question would be, what a good default might be. Maybe the actual > consumer default of 30 seconds might be sufficient. During one {{poll()}} > roundtrip, we would only call {{restoreConsumer.poll()}} once and restore a > single batch of records. This should take way less time than 30 seconds. -- This message was sent by Atlassian JIRA (v7.6.3#76005)