[
https://issues.apache.org/jira/browse/KAFKA-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036676#comment-18036676
]
PoAn Yang commented on KAFKA-19873:
-----------------------------------
Hi [~mjsax], the proposal looks good. If you're not starting to work on it, I
can help to write a KIP for it. I'm curious that does this producer heartbeat
mechanism only enable when the producer starts a transaction? If there is no
transaction and we start this heartbeat mechanism, it may waste bandwidth.
> Add explicit liveness check for transactional producers
> -------------------------------------------------------
>
> Key: KAFKA-19873
> URL: https://issues.apache.org/jira/browse/KAFKA-19873
> Project: Kafka
> Issue Type: Improvement
> Components: clients, producer
> Reporter: Matthias J. Sax
> Priority: Major
> Labels: needs-kip
>
> The producer does not have an explicit liveness check like the consumer,
> which sends periodic heartbeats if it's part of a consumer group. Because
> there is no "producer group" this is fine in general.
> However, for transactional producers, the missing liveness check has quite
> some downsides (for example KAFKA-19853).
> The problem is, that there is only an indirect liveness check via
> `transaction.timeout.ms` config. The purpose of `transaction.timeout.ms` is
> to avoid head-of-line blocking for read-committed consumers though, and it's
> just a side effect that a crashed producer does also hit this timeout
> eventually, too. The transaction timeout by itself, is not a liveness check.
> For the Kafka Streams case in particular, to react to a failed producers more
> quickly, we set an aggressive default transaction timeout of only 10 seconds,
> allowing the broker to abort a transaction quickly, allowing some other
> consumer to fetch offset quickly after a rebalance (otherwise, fetching
> offset is blocked on an open TX – cf
> [KIP-447|https://cwiki.apache.org/confluence/display/KAFKA/KIP-447%3A+Producer+scalability+for+exactly+once+semantics]).
> However, in many cases (not limited to Kafka Streams), it is desirable to
> actually allow transaction to take more time, but this implies that the
> producer error detection and failover mechanism gets slowed down. For this
> reason, users are hesitant to increase the transaction timeout, what may fire
> back by getting TX aborted too aggressively causing unwanted errors (it's
> particularly problematic for Kafka Streams, because we can't re-use previous
> `transaction.id` to fence off a pending TX pro-actively, as we moved off
> EOSv1 to EOSv2 implementation).
> Thus, for transactional producers, it would make sense to follow the consumer
> model, which allows for aggressive hard failure detection via
> `session.timeout.ms` plus longer processing loops via `max.poll.interval.ms`
> decoupling liveness check and "max processing" time. – We propose to add a
> new producer `session.timeout.ms` plus a new heartbeat RPC for transactional
> producers. If a tx-producer has a hard failure and stops sending heartbeats
> to the broker side transaction coordinator, the coordinator can abort the TX
> right away without the need to wait for the TX timeout. This allows uses to
> configure a low session timeout in combination with a larger transaction
> timeout, providing swift hard error detection plus longer transaction times.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)