[
https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Greg Harris resolved KAFKA-14548.
---------------------------------
Resolution: Duplicate
> Stable streams applications stall due to infrequent restoreConsumer polls
> -------------------------------------------------------------------------
>
> Key: KAFKA-14548
> URL: https://issues.apache.org/jira/browse/KAFKA-14548
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Reporter: Greg Harris
> Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications
> stall and become unable to process data after a rebalance
> (https://issues.apache.org/jira/browse/KAFKA-13405.) The root cause of which
> is that a restoreConsumer can be partitioned from a Kafka cluster with stale
> metadata, while the mainConsumer is healthy with up-to-date metadata. This is
> due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated
> while the streams app is running. This consumer is only `poll()`ed when the
> ChangelogReader::restore method is called and at least one changelog is in
> the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka
> consumers in contact with the kafka cluster. Infrequent polls are considered
> failures from the perspective of the consumer API. From the [official Kafka
> Consumer
> documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready
> for more data] is to move message processing to another thread, which allows
> the consumer to continue calling poll while the processor is still working.
> ...
> Note also that you will need to pause the partition so that no new records
> are received from poll until after thread has finished handling those
> previously returned.{noformat}
> With the current behavior, it is expected that the restoreConsumer will fall
> out of the group regularly and be considered failed, when the rest of the
> application is running exactly as intended.
> This is not normally an issue, as falling out of the group is easily repaired
> by joining the group during the next poll. It does mean that there is
> slightly higher latency to performing a restore, but that does not appear to
> be a major concern at this time.
> This does become an issue when other deeper assumptions about the usage of
> Kafka clients are violated. Relevant to this issue, it is assumed by the
> client metadata management logic that regular polling will take place, and
> that the regular poll call can be piggy-backed to initiate a metadata update.
> Without a regular poll, the regular metadata update cannot be performed, and
> the consumer violates its own `metadata.max.age.ms` configuration. This leads
> to the restoreConsumer having a much older metadata containing none of the
> currently live brokers, partitioning it from the cluster.
> Alleviating this failure mode does not _require_ the streams' polling
> behavior to change, as solutions for all clients have been considered
> (https://issues.apache.org/jira/browse/KAFKA-3068 and that family of
> duplicate issues).
> However, as a tactical fix for the issue, and one which does not require a
> KIP changing the behavior of {_}every kafka client{_}, we should consider
> changing the restoreConsumer poll behavior to bring it closer to the expected
> happy-path of at least one poll() every poll.interval.ms.
> If there is another hidden assumption of the clients that relies on regular
> polling, then this tactical fix may prevent users of the streams library from
> being affected, reducing the impact of that hidden assumption through
> defense-in-depth.
> This would also be a backport-able fix for streams users, instead of a fix to
> the consumers which would only apply to new versions of the consumers.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)