[ 
https://issues.apache.org/jira/browse/KAFKA-19873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036753#comment-18036753
 ] 

Andrew Schofield commented on KAFKA-19873:
------------------------------------------

This is interesting but it needs to be done very, very carefully. There's lots 
of scope for unintended consequences I think.

> Add explicit liveness check for transactional producers
> -------------------------------------------------------
>
>                 Key: KAFKA-19873
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19873
>             Project: Kafka
>          Issue Type: Improvement
>          Components: clients, producer 
>            Reporter: Matthias J. Sax
>            Priority: Major
>              Labels: needs-kip
>
> The producer does not have an explicit liveness check like the consumer, 
> which sends periodic heartbeats if it's part of a consumer group. Because 
> there is no "producer group" this is fine in general.
> However, for transactional producers, the missing liveness check has quite 
> some downsides (for example KAFKA-19853).
> The problem is, that there is only an indirect liveness check via 
> `transaction.timeout.ms` config. The purpose of `transaction.timeout.ms` is 
> to avoid head-of-line blocking for read-committed consumers though, and it's 
> just a side effect that a crashed producer does also hit this timeout 
> eventually, too. The transaction timeout by itself, is not a liveness check.
> For the Kafka Streams case in particular, to react to a failed producers more 
> quickly, we set an aggressive default transaction timeout of only 10 seconds, 
> allowing the broker to abort a transaction quickly, allowing some other 
> consumer to fetch offset quickly after a rebalance (otherwise, fetching 
> offset is blocked on an open TX – cf 
> [KIP-447|https://cwiki.apache.org/confluence/display/KAFKA/KIP-447%3A+Producer+scalability+for+exactly+once+semantics]).
> However, in many cases (not limited to Kafka Streams), it is desirable to 
> actually allow transaction to take more time, but this implies that the 
> producer error detection and failover mechanism gets slowed down. For this 
> reason, users are hesitant to increase the transaction timeout, what may fire 
> back by getting TX aborted too aggressively causing unwanted errors (it's 
> particularly problematic for Kafka Streams, because we can't re-use previous 
> `transaction.id` to fence off a pending TX pro-actively, as we moved off 
> EOSv1 to EOSv2 implementation).
> Thus, for transactional producers, it would make sense to follow the consumer 
> model, which allows for aggressive hard failure detection via 
> `session.timeout.ms` plus longer processing loops via `max.poll.interval.ms` 
> decoupling liveness check and "max processing" time. – We propose to add a 
> new producer `session.timeout.ms` plus a new heartbeat RPC for transactional 
> producers. If a tx-producer has a hard failure and stops sending heartbeats 
> to the broker side transaction coordinator, the coordinator can abort the TX 
> right away without the need to wait for the TX timeout. This allows uses to 
> configure a low session timeout in combination with a larger transaction 
> timeout, providing swift hard error detection plus longer transaction times.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to