A. Sophie Blee-Goldman created KAFKA-13295:
----------------------------------------------
Summary: Long restoration times for new tasks can lead to
transaction timeouts
Key: KAFKA-13295
URL: https://issues.apache.org/jira/browse/KAFKA-13295
Project: Kafka
Issue Type: Bug
Components: streams
Reporter: A. Sophie Blee-Goldman
Fix For: 3.1.0
In some EOS applications with relatively long restoration times we've noticed a
series of ProducerFencedExceptions occurring during/immediately after
restoration. The broker logs were able to confirm these were due to
transactions timing out.
In Streams, it turns out we automatically begin a new txn when calling {{send}}
(if there isn’t already one in flight). A {{send}} occurs often outside a
commit during active processing (eg writing to the changelog), leaving the txn
open until the next commit. And if a StreamThread has been actively processing
when a rebalance results in a new stateful task without revoking any existing
tasks, the thread won’t actually commit this open txn before it goes back into
the restoration phase while it builds up state for the new task. So the
in-flight transaction is left open during restoration, during which the
StreamThread only consumes from the changelog without committing, leaving it
vulnerable to timing out when restoration times exceed the configured
transaction.timeout.ms for the producer client.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)