[jira] [Comment Edited] (KAFKA-12679) Rebalancing a restoring or running task may cause directory livelocking with newly created task

Colt McNealy (Jira) Thu, 16 May 2024 17:09:05 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847115#comment-17847115
 ]


Colt McNealy edited comment on KAFKA-12679 at 5/17/24 12:08 AM:
----------------------------------------------------------------

We have pretty much the same issue when running with 5 stream threads and 
recovering from a mildly unclean shutdown. We get it for both `ACTIVE` and 
`STANDBY` tasks. This is the same whether or not the State Updater is enabled 
via the internal config.

 

We also notice that the application makes zero forward progress at all; 
restorations are stuck.

 

Also, [~lucasbru] 's comment about this being solved in `trunk` might be 
outdated? That is because, if I recall correctly, the State Updater was planned 
to be GA in 3.7.0 at one point and was then backed out. Is that correct?


was (Author: JIRAUSER301663):
We have pretty much the same issue when running with 5 stream threads and 
recovering from a mildly unclean shutdown. We get it for both `ACTIVE` and 
`STANDBY` tasks.

 

We also notice that the application makes zero forward progress at all; 
restorations are stuck.

> Rebalancing a restoring or running task may cause directory livelocking with 
> newly created task
> -----------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-12679
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12679
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.6.1
>         Environment: Broker and client version 2.6.1
> Multi-node broker cluster
> Multi-node, auto scaling streams app instances
>            Reporter: Peter Nahas
>            Assignee: Lucas Brutschy
>            Priority: Major
>             Fix For: 3.7.0
>
>         Attachments: Backoff-between-directory-lock-attempts.patch
>
>
> If a task that uses a state store is in the restoring state or in a running 
> state and the task gets rebalanced to a separate thread on the same instance, 
> the newly created task will attempt to lock the state store director while 
> the first thread is continuing to use it. This is totally normal and expected 
> behavior when the first thread is not yet aware of the rebalance. However, 
> that newly created task is effectively running a while loop with no backoff 
> waiting to lock the directory:
>  # TaskManager tells the task to restore in `tryToCompleteRestoration`
>  # The task attempts to lock the directory
>  # The lock attempt fails and throws a 
> `org.apache.kafka.streams.errors.LockException`
>  # TaskManager catches the exception, stops further processing on the task 
> and reports that not all tasks have restored
>  # The StreamThread `runLoop` continues to run.
> I've seen some documentation indicate that there is supposed to be a backoff 
> when this condition occurs, but there does not appear to be any in the code. 
> The result is that if this goes on for long enough, the lock-loop may 
> dominate CPU usage in the process and starve out the old stream thread task 
> processing.
>  
> When in this state, the DEBUG level logging for TaskManager will produce a 
> steady stream of messages like the following:
> {noformat}
> 2021-03-30 20:59:51,098 DEBUG --- [StreamThread-10] o.a.k.s.p.i.TaskManager   
>               : stream-thread [StreamThread-10] Could not initialize 0_34 due 
> to the following exception; will retry
> org.apache.kafka.streams.errors.LockException: stream-thread 
> [StreamThread-10] standby-task [0_34] Failed to lock the state directory for 
> task 0_34
> {noformat}
>  
>  
> I've attached a git formatted patch to resolve the issue. Simply detect the 
> scenario and sleep for the backoff time in the appropriate StreamThread.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-12679) Rebalancing a restoring or running task may cause directory livelocking with newly created task

Reply via email to