[ 
https://issues.apache.org/jira/browse/KAFKA-20077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053959#comment-18053959
 ] 

Lianet Magrans commented on KAFKA-20077:
----------------------------------------

Hi [~chickenchickenlove] ,sure, we're not actively working on a fix. Just for 
context, this is something that is fixed with the v2 tx protocol (this, and 
many other transaction-related issues) so upgrading to use is probably the best 
path forward when dealing with this kind of situation. Then I'm not sure what 
would be safe fix for this edge case on v1 really. Any change would have to 
consider all the timeouts already involved, that would have to remain respected 
as now. It gets tricky given that some timeouts are controlled/enforced by the 
client, others by the broker (where I expect we don't want to change anything 
around this, it's v2 already the one redefining this path). But surely 
interesting to look into it in case we can think of a fix that is worth 
pursuing. Thanks! 

> Producer using transactions v1 could hang on flush upon retriable failures 
> adding partitions to tx 
> ---------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-20077
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20077
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, producer 
>            Reporter: Lianet Magrans
>            Priority: Major
>
> We've seen some occurrences of producer.flush hanging indefinitely in 
> situations where a topic is deleted and the producer is using transactions v1 
> (not using 2pc)
> In the case where the producer has records in the buffer, and the topic 
> deletion happens right before adding the first partition to the transaction, 
> we could fall in a loop where the AddPartitionsToTx fails with a retriable 
> error, and is continuously retried. In this case, none of the timeouts 
> related to send, transactions or request seem to apply:
>  * [transaction.timeout.ms|http://transaction.timeout.ms/] -> not applied 
> because no partition has been added yet
>  * [delivery.timeout.ms|http://delivery.timeout.ms/] -> not applied because 
> the client does not attempt sending (where batching expiration applies) while 
> it's in a transactional request (i.e AddPartitionsToTx), early return here 
> [https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L333-L335])
>  * [request.timeout.ms|http://request.timeout.ms/] -> not applied because 
> it's not a request failure really, the high level operation too add 
> partitions is retried.
>  * 
> [default.api.timeout.ms|http://delivery.timeout.ms/] does not apply to the 
> producer.flush api by design (or to any produce request really)
> Client handing of retriable errors when adding partitions to tx (this would 
> be the case of UNKNOWN_TOPIC_OR_PARTITION when a topic is deleted):
> [https://github.com/apache/kafka/blob/3d267d45369818c804ed49c56e9ae405e28b234c/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1603-L1605]
> This only affects producers using tx v1, and it's solved with tx v2 
> (partitions not added to the tx separately, so delivery timeout checked on 
> send and applied, unblocking the flush operation).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to