[jira] [Commented] (KAFKA-19960) Spurious failure to close StateDirectory due to some task directories still locked

Matthias J. Sax (Jira) Mon, 15 Dec 2025 15:24:09 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045314#comment-18045314
 ]


Matthias J. Sax commented on KAFKA-19960:
-----------------------------------------

Thanks. The timeframe of the logs is rather short. But I'll spend some time to 
study it.

We did in parallel identify some other bug 
(https://issues.apache.org/jira/browse/KAFKA-19994) which seems to have similar 
symptoms, ie, it eventually results in a LOCK exception from RocksDB. – The 
root cause of K19994 is a `TimeoutExecption` from the clients, which Kafka 
Streams retires until it hits `task.timeout.ms` and gives up, but does not do a 
proper resource cleanup.

Not sure if it's the same thing you are hitting or not. K19994 should be 
independent of standby tasks though. Do you see any other exception before you 
hit the LOCK exception from RocksDB?

> Spurious failure to close StateDirectory due to some task directories still 
> locked
> ----------------------------------------------------------------------------------
>
>                 Key: KAFKA-19960
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19960
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.8.1
>            Reporter: Mikael Carlstedt
>            Priority: Critical
>         Attachments: debug-1.log, debug.log
>
>
> We are seeing random failures to close state directories while closing a 
> kafka-streams application in a test environment.
> *Preconditions:*
>  * Two state stores
>  * Three input partitions
>  * Stand-by replication enabled (NB: we have not been able to reproduce 
> without stand-by replication)
>  * Two instances running on a single host with different state directory.
> The application instances are started before each test case is executed, and 
> then closed when the test case has completed. Most of the time it works well 
> without any errors logged, but sometimes we see this error message when 
> closing an application instance:
>  
> {noformat}
> 25-12-04T13:01:18.711 ERROR o.a.k.s.p.i.StateDirectory:397 - Some task 
> directories still locked while closing state, this indicates unclean 
> shutdown: {0_2=   StreamsThread threadId: 
> messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
> TaskManager
>       MetadataState:
>       Tasks:
> , 0_0=        StreamsThread threadId: 
> messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
> TaskManager
>       MetadataState:
>       Tasks:
> }{noformat}
>  
>  * is has a knock-on effect on all of the following test cases, which fail 
> with this error message:
>  
> {noformat}
> 25-12-04T13:01:28.684 ERROR s.s.n.k.k.i.KafkaClient:179 - Unhandled exception 
> in Kafka streams application messages.nobill.campaigns.delay-queue
> org.apache.kafka.streams.errors.ProcessorStateException: Error opening store 
> rate-limiterpriority-chronological-store-0 at location 
> /tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
>       at 
> org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:330)
> ...
> Caused by: org.rocksdb.RocksDBException: lock hold by current process, 
> acquire time 1764849665 acquiring thread 14274883584: 
> /tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0/LOCK:
>  No locks available
>       at org.rocksdb.RocksDB.open(Native Method)
>       at org.rocksdb.RocksDB.open(RocksDB.java:307)
>       at 
> org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:324)
>       ... 19 common frames omitted{noformat}
> *Observation:*
>  
>  * Prior to the error, the two stand-by tasks that fail to release their 
> locks are "closed and recycled":
> {noformat}
> 25-12-04T13:01:18.080  INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Suspended running
> 25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:633 - 
> stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Recycling state for STANDBY task 0_2.
> 25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:644 - 
> stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Clearing all store caches registered in the state 
> manager: ...
> 25-12-04T13:01:18.080  INFO o.a.k.s.p.i.StandbyTask:254 - stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Closed and recycled state{noformat}
>  * By comparison, the third task is "closed clean":
> {noformat}
> 25-12-04T13:01:17.024  INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  standby-task [0_1] Suspended running
> 25-12-04T13:01:17.024 DEBUG o.a.k.s.p.i.ProcessorStateManager:585 - 
> stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  standby-task [0_1] Closing its state manager and all the registered state 
> stores: ...
> 25-12-04T13:01:17.028 DEBUG o.a.k.s.p.i.StateDirectory:377 - stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  Released state dir lock for task 0_1
> 25-12-04T13:01:17.028  INFO o.a.k.s.p.i.StandbyTask:232 - stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  standby-task [0_1] Closed clean{noformat}
> What is it that triggers this "recycling" of stand-by tasks?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-19960) Spurious failure to close StateDirectory due to some task directories still locked

Reply via email to