[jira] [Commented] (KAFKA-13295) Long restoration times for new tasks can lead to transaction timeouts

Guozhang Wang (Jira) Tue, 14 Sep 2021 15:22:05 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415195#comment-17415195
 ]


Guozhang Wang commented on KAFKA-13295:
---------------------------------------

As we talked about the potential fix in the near term, one possibility is that 
`onPartitionAssigned` when new tasks are created, we always blindly do a 
commitOffsetOrTransaction as well so that if there's any ongoing transactions, 
it would be committed before we potentially go on to restore the newly assigned 
tasks.

> Long restoration times for new tasks can lead to transaction timeouts
> ---------------------------------------------------------------------
>
>                 Key: KAFKA-13295
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13295
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: A. Sophie Blee-Goldman
>            Priority: Critical
>              Labels: eos
>             Fix For: 3.1.0
>
>
> In some EOS applications with relatively long restoration times we've noticed 
> a series of ProducerFencedExceptions occurring during/immediately after 
> restoration. The broker logs were able to confirm these were due to 
> transactions timing out.
> In Streams, it turns out we automatically begin a new txn when calling 
> {{send}} (if there isn’t already one in flight). A {{send}} occurs often 
> outside a commit during active processing (eg writing to the changelog), 
> leaving the txn open until the next commit. And if a StreamThread has been 
> actively processing when a rebalance results in a new stateful task without 
> revoking any existing tasks, the thread won’t actually commit this open txn 
> before it goes back into the restoration phase while it builds up state for 
> the new task. So the in-flight transaction is left open during restoration, 
> during which the StreamThread only consumes from the changelog without 
> committing, leaving it vulnerable to timing out when restoration times exceed 
> the configured transaction.timeout.ms for the producer client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13295) Long restoration times for new tasks can lead to transaction timeouts

Reply via email to