Shekhar Prasad Rajak created KAFKA-20381:
--------------------------------------------
Summary: Producer Transaction Session Heartbeat
Key: KAFKA-20381
URL: https://issues.apache.org/jira/browse/KAFKA-20381
Project: Kafka
Issue Type: New Feature
Components: group-coordinator, offset manager, producer
Reporter: Shekhar Prasad Rajak
When a streaming processing engine (Flink, Spark Structured Streaming, Beam, or
any custom exactly-once Kafka producer) opens a Kafka transaction and then
crashes without recovering, the transaction remains in ONGOING state on the
broker. This blocks the Last Stable Offset (LSO) for all read_committed
consumers until the broker's transaction.timeout.ms expires.
The current timeout mechanism has a fundamental design flaw:
transaction.timeout.ms serves two conflicting purposes:
* Correctness bound — "abort if this transaction runs longer than X" (data
safety)
* Liveness signal — "abort if the producer is dead" (operational safety)
*
Operators set transaction.timeout.ms high (often 15+ minutes) to accommodate
long checkpoint intervals and GC pauses. This means a dead producer can block
all downstream consumers for up to 15 minutes.
Things can go worse when streaming engine is responsible to end transaction but
crashed:
{{Phase 1: WRITE
engine.beginTransaction()
engine.send(records) → AddPartitionsToTxn RPC → ONGOING on broker
txnStartTimestamp = T0
Phase 2: BARRIER (checkpoint/micro-batch boundary)
engine.prepareCommit() → NO RPC to Kafka (client-side flag only)
Kafka broker: still ONGOING, no signal
Phase 3: COMMIT
engine.commitTransaction() → EndTxn(COMMIT) → broker completes transaction
}}
[ENGINE CRASH between Phase 1 and Phase 3]
→ No EndTxn ever arrives
→ Broker sees silence, interprets it as... nothing
→ LSO blocked for up to transaction.timeout.ms
After AddPartitionsToTxn is sent and data writing completes, no more RPCs are
sent until EndTxn. The broker has no way to distinguish between:
* A healthy producer preparing to commit (checkpoint in progress — could take 5
min)
* A dead producer that will never send EndTxn.
and this situation will block all downstream consumers.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)