[
https://issues.apache.org/jira/browse/KAFKA-20077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053959#comment-18053959
]
Lianet Magrans commented on KAFKA-20077:
----------------------------------------
Hi [~chickenchickenlove] ,sure, we're not actively working on a fix. Just for
context, this is something that is fixed with the v2 tx protocol (this, and
many other transaction-related issues) so upgrading to use is probably the best
path forward when dealing with this kind of situation. Then I'm not sure what
would be safe fix for this edge case on v1 really. Any change would have to
consider all the timeouts already involved, that would have to remain respected
as now. It gets tricky given that some timeouts are controlled/enforced by the
client, others by the broker (where I expect we don't want to change anything
around this, it's v2 already the one redefining this path). But surely
interesting to look into it in case we can think of a fix that is worth
pursuing. Thanks!
> Producer using transactions v1 could hang on flush upon retriable failures
> adding partitions to tx
> ---------------------------------------------------------------------------------------------------
>
> Key: KAFKA-20077
> URL: https://issues.apache.org/jira/browse/KAFKA-20077
> Project: Kafka
> Issue Type: Bug
> Components: clients, producer
> Reporter: Lianet Magrans
> Priority: Major
>
> We've seen some occurrences of producer.flush hanging indefinitely in
> situations where a topic is deleted and the producer is using transactions v1
> (not using 2pc)
> In the case where the producer has records in the buffer, and the topic
> deletion happens right before adding the first partition to the transaction,
> we could fall in a loop where the AddPartitionsToTx fails with a retriable
> error, and is continuously retried. In this case, none of the timeouts
> related to send, transactions or request seem to apply:
> * [transaction.timeout.ms|http://transaction.timeout.ms/] -> not applied
> because no partition has been added yet
> * [delivery.timeout.ms|http://delivery.timeout.ms/] -> not applied because
> the client does not attempt sending (where batching expiration applies) while
> it's in a transactional request (i.e AddPartitionsToTx), early return here
> [https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L333-L335])
> * [request.timeout.ms|http://request.timeout.ms/] -> not applied because
> it's not a request failure really, the high level operation too add
> partitions is retried.
> *
> [default.api.timeout.ms|http://delivery.timeout.ms/] does not apply to the
> producer.flush api by design (or to any produce request really)
> Client handing of retriable errors when adding partitions to tx (this would
> be the case of UNKNOWN_TOPIC_OR_PARTITION when a topic is deleted):
> [https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1603-L1605]
> This only affects producers using tx v1, and it's solved with tx v2
> (partitions not added to the tx separately, so delivery timeout checked on
> send and applied, unblocking the flush operation).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)