Matthias J. Sax created KAFKA-19873:
---------------------------------------
Summary: Add explicit liveness check for transactional producers
Key: KAFKA-19873
URL: https://issues.apache.org/jira/browse/KAFKA-19873
Project: Kafka
Issue Type: Improvement
Components: clients, producer
Reporter: Matthias J. Sax
The producer does not have an explicit liveness check like the consumer, which
sends periodic heartbeats if it's part of a consumer group. Because there is no
"producer group" this is fine in general.
However, for transactional producers, the missing liveness check has quite some
downsides (for example KAFKA-19853).
The problem is, that there is only an indirect liveness check via
`transaction.timeout.ms` config. The purpose of `transaction.timeout.ms` is to
avoid head-of-line blocking for read-committed consumers though, and it's just
a side effect that a crashed producer does also hit this timeout eventually,
too. The transaction timeout by itself, is not a liveness check.
For the Kafka Streams case in particular, to react to a failed producers more
quickly, we set an aggressive default transaction timeout of only 10 seconds,
allowing the broker to abort a transaction quickly, allowing some other
consumer to fetch offset quickly after a rebalance (otherwise, fetching offset
is blocked on an open TX – cf
[KIP-447|https://cwiki.apache.org/confluence/display/KAFKA/KIP-447%3A+Producer+scalability+for+exactly+once+semantics]).
However, in many cases (not limited to Kafka Streams), it is desirable to
actually allow transaction to take more time, but this implies that the
producer error detection and failover mechanism gets slowed down. For this
reason, users are hesitant to increase the transaction timeout, what may fire
back by getting TX aborted too aggressively causing unwanted errors (it's
particularly problematic for Kafka Streams, because we can't re-use previous
`transaction.id` to fence off a pending TX pro-actively, as we moved off EOSv1
to EOSv2 implementation).
Thus, for transactional producers, it would make sense to follow the consumer
model, which allows for aggressive hard failure detection via
`session.timeout.ms` plus longer processing loops via `max.poll.interval.ms`
decoupling liveness check and "max processing" time. – We propose to add a new
producer `session.timeout.ms` plus a new heartbeat RPC for transactional
producers. If a tx-producer has a hard failure and stops sending heartbeats to
the broker side transaction coordinator, the coordinator can abort the TX right
away without the need to wait for the TX timeout. This allows uses to configure
a low session timeout in combination with a larger transaction timeout,
providing swift hard error detection plus longer transaction times.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)