[
https://issues.apache.org/jira/browse/KAFKA-20594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081925#comment-18081925
]
Canxuan Wang commented on KAFKA-20594:
--------------------------------------
Hi Aayush, I read through the description and walked a few code paths in trunk.
A few notes, with the caveat that I cannot identify a root cause from the
evidence in the ticket:
On the idle close: this is by design and on its own it does not kill a
{{{}StreamThread{}}}. {{connections.max.idle.ms}} defaults to {{9 * 60 * 1000}}
on the client, and the code comment in
{{{}ConsumerConfig{}}}/{{{}ProducerConfig{}}} reads "default is set to be a bit
lower than the server default (10 min), to avoid both client and server closing
connection at same time". The {{About to close the idle connection}} log line
you may have seen is emitted at TRACE from
{{{}org.apache.kafka.common.network.Selector{}}}. The next poll re-establishes
the socket. If a {{StreamThread}} did move to {{DEAD}} and the client to
{{{}ERROR{}}}, something other than the idle close is the proximate cause, and
it is being swallowed somewhere in your current logging configuration.
On question 2 (close + re-instantiate): re-creating a {{KafkaStreams}} with the
same {{application.id}} triggers a rebalance, and duplicate-processing risk at
the failure boundary follows your {{{}processing.guarantee{}}}.
{{at_least_once}} (default) means some duplicates are possible across the
committed-but-replayed slice; {{exactly_once_v2}} reduces that to the topology
output. The cycle recovers, but it is heavier than necessary for a single
failed thread.
On question 3: yes, since 2.8 the supported path for a transient per-thread
failure is {{KafkaStreams#setUncaughtExceptionHandler}} returning
{{{}StreamThreadExceptionResponse.REPLACE_THREAD{}}}. The other two enum values
are {{SHUTDOWN_CLIENT}} and {{SHUTDOWN_APPLICATION}} (defined in
{{streams/.../errors/StreamsUncaughtExceptionHandler.java}}). When no handler
is registered, the default response is {{SHUTDOWN_CLIENT}} (see
{{KafkaStreams.java}} where the field is initialised with {{{}t ->
SHUTDOWN_CLIENT{}}}), which matches the symptom of a silent move to
{{{}ERROR{}}}. Registering the handler is additive and changes no behaviour
until a thread actually fails.
On question 4: there are no JVM flags specific to this. The loggers worth
turning up before reproducing are {{{}org.apache.kafka.streams=DEBUG{}}},
{{{}org.apache.kafka.clients.consumer=DEBUG{}}}, and
{{{}org.apache.kafka.clients.NetworkClient=DEBUG{}}}. The {{Selector}}
idle-close path is at TRACE if you want to confirm it directly.
On question 5: {{KafkaStreams#setStateListener}} gives you {{RUNNING →
REBALANCING → PENDING_ERROR → ERROR}} transitions in code, which is useful for
alerting before the client is fully dead. The client-level JMX gauges
{{kafka.streams:type=stream-metrics,client-id=<id>,name=alive-stream-threads}}
and {{failed-stream-threads}} give the same view externally.
One general note: 3.4.0 is past the supported window, and there have been a
number of fixes in the streams shutdown and error-handling paths since. If you
can reproduce on a recent 3.x or 4.x build with the handler and state listener
registered, the resulting JIRA would be considerably easier to act on, and the
captured exception would point at the real cause rather than the idle close.
> Kafka Streams issue
> -------------------
>
> Key: KAFKA-20594
> URL: https://issues.apache.org/jira/browse/KAFKA-20594
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 3.4.0
> Reporter: Aayush Gupta
> Priority: Blocker
>
> Looking for some Kafka Streams expertise on an issue we're investigating.
> Problem: A KafkaStreams-based consumer stops polling after ~15 minutes of
> topic inactivity. The adapter stays alive, but no messages are picked up
> until it's manually restarted. Reproduces on both IBM MQ-backed Kafka and
> Confluent Cloud.
> Suspected cause: Back-to-back expiry of
> [connections.max.idle.ms|http://connections.max.idle.ms/] (client, 9 min
> default) + broker idle timeout (~10 min). Stream thread dies on the next poll
> attempt, KafkaStreams goes to ERROR state silently — no StateListener or
> UncaughtExceptionHandler was registered, so nothing recovers.
> Questions:
> Is this a known pattern with KafkaStreams on idle topics? Any recommended
> approach?
> Is the close() + re-instantiate pattern safe? Any rebalance/duplicate risks?
> For Kafka 2.8+, should we prefer StreamsUncaughtExceptionHandler with
> REPLACE_THREAD instead of a full restart?
> Any input appreciated!
> Not observing any error stacks or exceptions in the logs when the issue
> occurs.
> As part of our investigation, we wanted to check if there are any JVM flags
> or framework-level configurations that can be enabled to extract more
> detailed Kafka framework debug logs, particularly around Kafka Streams and
> consumer lifecycle behavior.
> Given the absence of exceptions or diagnostic logs, it is unclear whether
> further tuning of Kafka consumer properties alone would meaningfully
> alleviate the issue without better visibility into the Kafka Streams
> internals.
> From your experience, do you have any recommendations beyond enabling more
> detailed Kafka/Streams logging—such as known patterns, specific stream-thread
> behaviors, or client-side recovery considerations—that we should be exploring
> in parallel?
> Any guidance would be greatly appreciated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)