[jira] [Comment Edited] (KAFKA-19960) Spurious failure to close StateDirectory due to some task directories still locked

Matthias J. Sax (Jira) Thu, 04 Dec 2025 09:16:15 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042854#comment-18042854
 ]


Matthias J. Sax edited comment on KAFKA-19960 at 12/4/25 5:15 PM:
------------------------------------------------------------------

Thanks for reporting this issue. Can you reproduce this issue with newer 
version of Kafka Streams, like 4.1.1? We did fix a couple of lock related 
issues already. – Just want to make sure, we don't investigate something that 
was already fixed.

The only thing sticking out to me atm is:
{code:java}
location 
/tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
 {code}
Seems you did not set `state.dir` config but use the default `/tmp` directory. 
It's not recommended to use `/tmp` because the OS could mess with it.

Can you set `state.dir` to a different directory? Does the problem persist for 
this case?
{quote}What is it that triggers this "recycling" of stand-by tasks?
{quote}
This means, we convert the standby task into an active task. The lock won't be 
release for this case (the active task just take on the lock from the standby 
task), but the lock should be release when the created active task get's closed 
later. – It might be possible that there is some bug in transferring lock 
ownership.


was (Author: mjsax):
Thanks for reporting this issue. Can you reproduce this issue with newer 
version of Kafka Streams, like 4.1.1? We did fix a couple of lock related 
issues already. – Just want to make sure, we don't investigate something that 
was already fixed.

The only thing sticking out to me atm is:
{code:java}
location 
/tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
 {code}
Seems you did not set `state.dir` config but use the default `/tmp` directory. 
It's not recommended to use `/tmp` because the OS could mess with it.

Can you set `state.dir` to a different directory? Does the problem persist for 
this case?

> Spurious failure to close StateDirectory due to some task directories still 
> locked
> ----------------------------------------------------------------------------------
>
>                 Key: KAFKA-19960
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19960
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.8.1
>            Reporter: Mikael Carlstedt
>            Priority: Major
>
> We are seeing random failures to close state directories while closing a 
> kafka-streams application in a test environment.
> *Preconditions:*
>  * Two state stores
>  * Three input partitions
>  * Stand-by replication enabled (NB: we have not been able to reproduce 
> without stand-by replication)
>  * Two instances running on a single host with different state directory.
> The application instances are started before each test case is executed, and 
> then closed when the test case has completed. Most of the time it works well 
> without any errors logged, but sometimes we see this error message when 
> closing an application instance:
>  
> {noformat}
> 25-12-04T13:01:18.711 ERROR o.a.k.s.p.i.StateDirectory:397 - Some task 
> directories still locked while closing state, this indicates unclean 
> shutdown: {0_2=   StreamsThread threadId: 
> messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
> TaskManager
>       MetadataState:
>       Tasks:
> , 0_0=        StreamsThread threadId: 
> messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
> TaskManager
>       MetadataState:
>       Tasks:
> }{noformat}
>  
>  * is has a knock-on effect on all of the following test cases, which fail 
> with this error message:
>  
> {noformat}
> 25-12-04T13:01:28.684 ERROR s.s.n.k.k.i.KafkaClient:179 - Unhandled exception 
> in Kafka streams application messages.nobill.campaigns.delay-queue
> org.apache.kafka.streams.errors.ProcessorStateException: Error opening store 
> rate-limiterpriority-chronological-store-0 at location 
> /tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
>       at 
> org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:330)
> ...
> Caused by: org.rocksdb.RocksDBException: lock hold by current process, 
> acquire time 1764849665 acquiring thread 14274883584: 
> /tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0/LOCK:
>  No locks available
>       at org.rocksdb.RocksDB.open(Native Method)
>       at org.rocksdb.RocksDB.open(RocksDB.java:307)
>       at 
> org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:324)
>       ... 19 common frames omitted{noformat}
> *Observation:*
>  
>  * Prior to the error, the two stand-by tasks that fail to release their 
> locks are "closed and recycled":
> {noformat}
> 25-12-04T13:01:18.080  INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Suspended running
> 25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:633 - 
> stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Recycling state for STANDBY task 0_2.
> 25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:644 - 
> stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Clearing all store caches registered in the state 
> manager: ...
> 25-12-04T13:01:18.080  INFO o.a.k.s.p.i.StandbyTask:254 - stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Closed and recycled state{noformat}
>  * By comparison, the third task is "closed clean":
> {noformat}
> 25-12-04T13:01:17.024  INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  standby-task [0_1] Suspended running
> 25-12-04T13:01:17.024 DEBUG o.a.k.s.p.i.ProcessorStateManager:585 - 
> stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  standby-task [0_1] Closing its state manager and all the registered state 
> stores: ...
> 25-12-04T13:01:17.028 DEBUG o.a.k.s.p.i.StateDirectory:377 - stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  Released state dir lock for task 0_1
> 25-12-04T13:01:17.028  INFO o.a.k.s.p.i.StandbyTask:232 - stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  standby-task [0_1] Closed clean{noformat}
> What is it that triggers this "recycling" of stand-by tasks?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-19960) Spurious failure to close StateDirectory due to some task directories still locked

Reply via email to