[
https://issues.apache.org/jira/browse/KAFKA-20077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18052487#comment-18052487
]
sanghyeok An commented on KAFKA-20077:
--------------------------------------
[~lianetm] HI!
Are you planning to handle this yourself, or ask someone specific to look into
it? If not, would it be okay if I take a look?
> Producer using transactions v1 could hang on flush upon retriable failures
> adding partitions to tx
> ---------------------------------------------------------------------------------------------------
>
> Key: KAFKA-20077
> URL: https://issues.apache.org/jira/browse/KAFKA-20077
> Project: Kafka
> Issue Type: Bug
> Components: clients, producer
> Reporter: Lianet Magrans
> Priority: Major
>
> We've seen some occurrences of producer.flush hanging indefinitely in
> situations where a topic is deleted and the producer is using transactions v1
> (not using 2pc)
> In the case where the producer has records in the buffer, and the topic
> deletion happens right before adding the first partition to the transaction,
> we could fall in a loop where the AddPartitionsToTx fails with a retriable
> error, and is continuously retried. In this case, none of the timeouts
> related to send, transactions or request seem to apply:
> * [transaction.timeout.ms|http://transaction.timeout.ms/] -> not applied
> because no partition has been added yet
> * [delivery.timeout.ms|http://delivery.timeout.ms/] -> not applied because
> the client does not attempt sending (where batching expiration applies) while
> it's in a transactional request (i.e AddPartitionsToTx), early return here
> [https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L333-L335])
> * [request.timeout.ms|http://request.timeout.ms/] -> not applied because
> it's not a request failure really, the high level operation too add
> partitions is retried.
> *
> [default.api.timeout.ms|http://delivery.timeout.ms/] does not apply to the
> producer.flush api by design (or to any produce request really)
> Client handing of retriable errors when adding partitions to tx (this would
> be the case of UNKNOWN_TOPIC_OR_PARTITION when a topic is deleted):
> [https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1603-L1605]
> This only affects producers using tx v1, and it's solved with tx v2
> (partitions not added to the tx separately, so delivery timeout checked on
> send and applied, unblocking the flush operation).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)