[ 
https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651700#comment-17651700
 ] 

Greg Harris commented on KAFKA-14548:
-------------------------------------

> Note that the JavaDoc you quote is about a consumer that is part of a 
> consumer group. However, the restore consumer is a "stand along" consumer and 
> not part of any group and thus periodic polling is not necessary. There is no 
> consumer group, group management, or heart beating etc.

Yes, I understand that the javadoc is referencing consumer group membership, 
and thanks for pointing out that the restoreConsumer does not make use of a 
consumer group. However, as many if not the majority of consumers do use 
consumer group membership, it is understandable that someone 'forgot' about the 
stand-alone consumers and their use-case when implementing something. I do not 
expect that this metadata refresh bug is the only assumption that will be ever 
made that disproportionately affects the stand-alone consumers. I think that 
the streams library can protect itself from those sorts of bugs that we don't 
even know about yet by conforming to the dominant pattern of use of the 
consumers.

> I am not an expert on the consumer, but I would expect that the restore 
> consumer would refresh its metadata when we use it again if it's cached 
> metadata aged out (for any API call, not just poll()) after a longer pause? 
> Thus, as long as its bootstrap servers are reachable, it should be able to 
> refresh its metadata.

This is not how the consumer has worked, as far as I know, since 2016, when the 
first tickets and discussions about falling back to bootstrap.servers were 
opened. What I described is the current behavior of the consumer, until a KIP 
is passed to change it. This has not received the attention it needed, maybe 
due to people viewing it as more of a failure scenario, rather than something 
that is normally happening on a happy-path execution. When I was reading those 
issues originally, I thought that only a network partition could cause this 
behavior, until I realized it could also be caused by infrequent polling.

> if it's a client issue, we should not put a workaround into Stream to mask 
> the client issue but rather fix the client.

If we were in an incident analysis and using the swiss-cheese-model, both 
clients and streams could be implicated, because either library could have 
taken steps to prevent the outage. Yes, it is sufficient to fix it only in the 
clients, and it is reasonable in the abstract for streams to say it's "not my 
job" to fix it. But in practical terms streams is particularly affected by this 
bug because of its usage patterns, and users of streams are disempowered to fix 
the poll behavior themselves short of forking the project.


I think that we should pursue fixes for both the clients in general and streams 
in particular. We've already requested the first (there's like 8 tickets doing 
so) and this ticket is requesting the latter.

> Stable streams applications stall due to infrequent restoreConsumer polls
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-14548
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14548
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Greg Harris
>            Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications 
> stall and become unable to process data after a rebalance 
> (https://issues.apache.org/jira/browse/KAFKA-13405.) The root cause of which 
> is that a restoreConsumer can be partitioned from a Kafka cluster with stale 
> metadata, while the mainConsumer is healthy with up-to-date metadata. This is 
> due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated 
> while the streams app is running. This consumer is only `poll()`ed when the 
> ChangelogReader::restore method is called and at least one changelog is in 
> the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka 
> consumers in contact with the kafka cluster. Infrequent polls are considered 
> failures from the perspective of the consumer API. From the [official Kafka 
> Consumer 
> documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready 
> for more data] is to move message processing to another thread, which allows 
> the consumer to continue calling poll while the processor is still working.
> ...
> Note also that you will need to pause the partition so that no new records 
> are received from poll until after thread has finished handling those 
> previously returned.{noformat}
> With the current behavior, it is expected that the restoreConsumer will fall 
> out of the group regularly and be considered failed, when the rest of the 
> application is running exactly as intended.
> This is not normally an issue, as falling out of the group is easily repaired 
> by joining the group during the next poll. It does mean that there is 
> slightly higher latency to performing a restore, but that does not appear to 
> be a major concern at this time.
> This does become an issue when other deeper assumptions about the usage of 
> Kafka clients are violated. Relevant to this issue, it is assumed by the 
> client metadata management logic that regular polling will take place, and 
> that the regular poll call can be piggy-backed to initiate a metadata update. 
> Without a regular poll, the regular metadata update cannot be performed, and 
> the consumer violates its own `metadata.max.age.ms` configuration. This leads 
> to the restoreConsumer having a much older metadata containing none of the 
> currently live brokers, partitioning it from the cluster.
> Alleviating this failure mode does not _require_ the streams' polling 
> behavior to change, as solutions for all clients have been considered 
> (https://issues.apache.org/jira/browse/KAFKA-3068 and that family of 
> duplicate issues).
> However, as a tactical fix for the issue, and one which does not require a 
> KIP changing the behavior of {_}every kafka client{_}, we should consider 
> changing the restoreConsumer poll behavior to bring it closer to the expected 
> happy-path of at least one poll() every poll.interval.ms.
> If there is another hidden assumption of the clients that relies on regular 
> polling, then this tactical fix may prevent users of the streams library from 
> being affected, reducing the impact of that hidden assumption through 
> defense-in-depth.
> This would also be a backport-able fix for streams users, instead of a fix to 
> the consumers which would only apply to new versions of the consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to