[jira] [Commented] (KAFKA-13295) Long restoration times for new tasks can lead to transaction timeouts

A. Sophie Blee-Goldman (Jira) Tue, 28 Sep 2021 16:06:10 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421822#comment-17421822
 ]


A. Sophie Blee-Goldman commented on KAFKA-13295:
------------------------------------------------

{quote}we always blindly do a commitOffsetOrTransaction as well so that if 
there's any ongoing transactions, it would be committed before we potentially 
go on to restore the newly assigned tasks.
{quote}
I still think we want to keep the scope of committing offsets as narrow as 
possible, since committing adds a nontrivial overhead to the rebalance. The 
ideal fix would be to just start checking on the transaction time during 
restoration, and if we find ourselves getting too close to the timeout then we 
commit. But that's going to introduce some unnecessary complexity, so a good 
middle ground might be to just check if `Task#needsCommit` is true for any of 
the active tasks. And maybe also to check if we even have any new tasks to 
restore. I think those two checks would be simple enough while still keeping 
the committing sufficiently narrow in scope.

[~sagarrao] does that make sense?

> Long restoration times for new tasks can lead to transaction timeouts
> ---------------------------------------------------------------------
>
>                 Key: KAFKA-13295
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13295
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: A. Sophie Blee-Goldman
>            Assignee: Sagar Rao
>            Priority: Critical
>              Labels: eos
>             Fix For: 3.1.0
>
>
> In some EOS applications with relatively long restoration times we've noticed 
> a series of ProducerFencedExceptions occurring during/immediately after 
> restoration. The broker logs were able to confirm these were due to 
> transactions timing out.
> In Streams, it turns out we automatically begin a new txn when calling 
> {{send}} (if there isn’t already one in flight). A {{send}} occurs often 
> outside a commit during active processing (eg writing to the changelog), 
> leaving the txn open until the next commit. And if a StreamThread has been 
> actively processing when a rebalance results in a new stateful task without 
> revoking any existing tasks, the thread won’t actually commit this open txn 
> before it goes back into the restoration phase while it builds up state for 
> the new task. So the in-flight transaction is left open during restoration, 
> during which the StreamThread only consumes from the changelog without 
> committing, leaving it vulnerable to timing out when restoration times exceed 
> the configured transaction.timeout.ms for the producer client.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13295) Long restoration times for new tasks can lead to transaction timeouts

Reply via email to