[
https://issues.apache.org/jira/browse/KAFKA-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732025#comment-17732025
]
Andras Hatvani edited comment on KAFKA-12679 at 6/13/23 11:08 AM:
------------------------------------------------------------------
[~lbrutschy] [~mjsax] This is a pressing issue for my client as I could only
work it around by reducing {{num.stream.threads}} to {{1}} which shouldn't be
necessary.
was (Author: andras hatvani):
[~lbrutschy] [~mjsax] This is a pressing issue for my client and I've worked it
around by reducing `num.stream.threads` to `1` which shouldn't be necessary.
> Rebalancing a restoring or running task may cause directory livelocking with
> newly created task
> -----------------------------------------------------------------------------------------------
>
> Key: KAFKA-12679
> URL: https://issues.apache.org/jira/browse/KAFKA-12679
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 2.6.1
> Environment: Broker and client version 2.6.1
> Multi-node broker cluster
> Multi-node, auto scaling streams app instances
> Reporter: Peter Nahas
> Priority: Major
> Attachments: Backoff-between-directory-lock-attempts.patch
>
>
> If a task that uses a state store is in the restoring state or in a running
> state and the task gets rebalanced to a separate thread on the same instance,
> the newly created task will attempt to lock the state store director while
> the first thread is continuing to use it. This is totally normal and expected
> behavior when the first thread is not yet aware of the rebalance. However,
> that newly created task is effectively running a while loop with no backoff
> waiting to lock the directory:
> # TaskManager tells the task to restore in `tryToCompleteRestoration`
> # The task attempts to lock the directory
> # The lock attempt fails and throws a
> `org.apache.kafka.streams.errors.LockException`
> # TaskManager catches the exception, stops further processing on the task
> and reports that not all tasks have restored
> # The StreamThread `runLoop` continues to run.
> I've seen some documentation indicate that there is supposed to be a backoff
> when this condition occurs, but there does not appear to be any in the code.
> The result is that if this goes on for long enough, the lock-loop may
> dominate CPU usage in the process and starve out the old stream thread task
> processing.
>
> When in this state, the DEBUG level logging for TaskManager will produce a
> steady stream of messages like the following:
> {noformat}
> 2021-03-30 20:59:51,098 DEBUG --- [StreamThread-10] o.a.k.s.p.i.TaskManager
> : stream-thread [StreamThread-10] Could not initialize 0_34 due
> to the following exception; will retry
> org.apache.kafka.streams.errors.LockException: stream-thread
> [StreamThread-10] standby-task [0_34] Failed to lock the state directory for
> task 0_34
> {noformat}
>
>
> I've attached a git formatted patch to resolve the issue. Simply detect the
> scenario and sleep for the backoff time in the appropriate StreamThread.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)