[ 
https://issues.apache.org/jira/browse/KAFKA-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042854#comment-18042854
 ] 

Matthias J. Sax edited comment on KAFKA-19960 at 12/4/25 5:13 PM:
------------------------------------------------------------------

Thanks for reporting this issue. Can you reproduce this issue with newer 
version of Kafka Streams, like 4.1.1? We did fix a couple of lock related 
issues already. – Just want to make sure, we don't investigate something that 
was already fixed.

The only thing sticking out to me atm is:
{code:java}
location 
/tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
 {code}
Seems you did not set `state.dir` config but use the default `/tmp` directory. 
It's not recommended to use `/tmp` because the OS could mess with it.

Can you set `state.dir` to a different directory? Does the problem persist for 
this case?


was (Author: mjsax):
Thanks for reporting this issue. Can you reproduce this issue with newer 
version of Kafka Streams, like 4.1.1? We did fix a couple of lock related 
issues already. – Just want to make sure, we don't investigate something that 
was already fixed.

> Spurious failure to close StateDirectory due to some task directories still 
> locked
> ----------------------------------------------------------------------------------
>
>                 Key: KAFKA-19960
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19960
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.8.1
>            Reporter: Mikael Carlstedt
>            Priority: Major
>
> We are seeing random failures to close state directories while closing a 
> kafka-streams application in a test environment.
> *Preconditions:*
>  * Two state stores
>  * Three input partitions
>  * Stand-by replication enabled (NB: we have not been able to reproduce 
> without stand-by replication)
>  * Two instances running on a single host with different state directory.
> The application instances are started before each test case is executed, and 
> then closed when the test case has completed. Most of the time it works well 
> without any errors logged, but sometimes we see this error message when 
> closing an application instance:
>  
> {noformat}
> 25-12-04T13:01:18.711 ERROR o.a.k.s.p.i.StateDirectory:397 - Some task 
> directories still locked while closing state, this indicates unclean 
> shutdown: {0_2=   StreamsThread threadId: 
> messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
> TaskManager
>       MetadataState:
>       Tasks:
> , 0_0=        StreamsThread threadId: 
> messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
> TaskManager
>       MetadataState:
>       Tasks:
> }{noformat}
>  
>  * is has a knock-on effect on all of the following test cases, which fail 
> with this error message:
>  
> {noformat}
> 25-12-04T13:01:28.684 ERROR s.s.n.k.k.i.KafkaClient:179 - Unhandled exception 
> in Kafka streams application messages.nobill.campaigns.delay-queue
> org.apache.kafka.streams.errors.ProcessorStateException: Error opening store 
> rate-limiterpriority-chronological-store-0 at location 
> /tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
>       at 
> org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:330)
> ...
> Caused by: org.rocksdb.RocksDBException: lock hold by current process, 
> acquire time 1764849665 acquiring thread 14274883584: 
> /tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0/LOCK:
>  No locks available
>       at org.rocksdb.RocksDB.open(Native Method)
>       at org.rocksdb.RocksDB.open(RocksDB.java:307)
>       at 
> org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:324)
>       ... 19 common frames omitted{noformat}
> *Observation:*
>  
>  * Prior to the error, the two stand-by tasks that fail to release their 
> locks are "closed and recycled":
> {noformat}
> 25-12-04T13:01:18.080  INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Suspended running
> 25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:633 - 
> stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Recycling state for STANDBY task 0_2.
> 25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:644 - 
> stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Clearing all store caches registered in the state 
> manager: ...
> 25-12-04T13:01:18.080  INFO o.a.k.s.p.i.StandbyTask:254 - stream-thread 
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
>  standby-task [0_2] Closed and recycled state{noformat}
>  * By comparison, the third task is "closed clean":
> {noformat}
> 25-12-04T13:01:17.024  INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  standby-task [0_1] Suspended running
> 25-12-04T13:01:17.024 DEBUG o.a.k.s.p.i.ProcessorStateManager:585 - 
> stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  standby-task [0_1] Closing its state manager and all the registered state 
> stores: ...
> 25-12-04T13:01:17.028 DEBUG o.a.k.s.p.i.StateDirectory:377 - stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  Released state dir lock for task 0_1
> 25-12-04T13:01:17.028  INFO o.a.k.s.p.i.StandbyTask:232 - stream-thread 
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
>  standby-task [0_1] Closed clean{noformat}
> What is it that triggers this "recycling" of stand-by tasks?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to