[
https://issues.apache.org/jira/browse/KAFKA-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18042854#comment-18042854
]
Matthias J. Sax edited comment on KAFKA-19960 at 12/4/25 5:13 PM:
------------------------------------------------------------------
Thanks for reporting this issue. Can you reproduce this issue with newer
version of Kafka Streams, like 4.1.1? We did fix a couple of lock related
issues already. – Just want to make sure, we don't investigate something that
was already fixed.
The only thing sticking out to me atm is:
{code:java}
location
/tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
{code}
Seems you did not set `state.dir` config but use the default `/tmp` directory.
It's not recommended to use `/tmp` because the OS could mess with it.
Can you set `state.dir` to a different directory? Does the problem persist for
this case?
was (Author: mjsax):
Thanks for reporting this issue. Can you reproduce this issue with newer
version of Kafka Streams, like 4.1.1? We did fix a couple of lock related
issues already. – Just want to make sure, we don't investigate something that
was already fixed.
> Spurious failure to close StateDirectory due to some task directories still
> locked
> ----------------------------------------------------------------------------------
>
> Key: KAFKA-19960
> URL: https://issues.apache.org/jira/browse/KAFKA-19960
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 3.8.1
> Reporter: Mikael Carlstedt
> Priority: Major
>
> We are seeing random failures to close state directories while closing a
> kafka-streams application in a test environment.
> *Preconditions:*
> * Two state stores
> * Three input partitions
> * Stand-by replication enabled (NB: we have not been able to reproduce
> without stand-by replication)
> * Two instances running on a single host with different state directory.
> The application instances are started before each test case is executed, and
> then closed when the test case has completed. Most of the time it works well
> without any errors logged, but sometimes we see this error message when
> closing an application instance:
>
> {noformat}
> 25-12-04T13:01:18.711 ERROR o.a.k.s.p.i.StateDirectory:397 - Some task
> directories still locked while closing state, this indicates unclean
> shutdown: {0_2= StreamsThread threadId:
> messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
> TaskManager
> MetadataState:
> Tasks:
> , 0_0= StreamsThread threadId:
> messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
> TaskManager
> MetadataState:
> Tasks:
> }{noformat}
>
> * is has a knock-on effect on all of the following test cases, which fail
> with this error message:
>
> {noformat}
> 25-12-04T13:01:28.684 ERROR s.s.n.k.k.i.KafkaClient:179 - Unhandled exception
> in Kafka streams application messages.nobill.campaigns.delay-queue
> org.apache.kafka.streams.errors.ProcessorStateException: Error opening store
> rate-limiterpriority-chronological-store-0 at location
> /tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
> at
> org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:330)
> ...
> Caused by: org.rocksdb.RocksDBException: lock hold by current process,
> acquire time 1764849665 acquiring thread 14274883584:
> /tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0/LOCK:
> No locks available
> at org.rocksdb.RocksDB.open(Native Method)
> at org.rocksdb.RocksDB.open(RocksDB.java:307)
> at
> org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:324)
> ... 19 common frames omitted{noformat}
> *Observation:*
>
> * Prior to the error, the two stand-by tasks that fail to release their
> locks are "closed and recycled":
> {noformat}
> 25-12-04T13:01:18.080 INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
> standby-task [0_2] Suspended running
> 25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:633 -
> stream-thread
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
> standby-task [0_2] Recycling state for STANDBY task 0_2.
> 25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:644 -
> stream-thread
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
> standby-task [0_2] Clearing all store caches registered in the state
> manager: ...
> 25-12-04T13:01:18.080 INFO o.a.k.s.p.i.StandbyTask:254 - stream-thread
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
> standby-task [0_2] Closed and recycled state{noformat}
> * By comparison, the third task is "closed clean":
> {noformat}
> 25-12-04T13:01:17.024 INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
> standby-task [0_1] Suspended running
> 25-12-04T13:01:17.024 DEBUG o.a.k.s.p.i.ProcessorStateManager:585 -
> stream-thread
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
> standby-task [0_1] Closing its state manager and all the registered state
> stores: ...
> 25-12-04T13:01:17.028 DEBUG o.a.k.s.p.i.StateDirectory:377 - stream-thread
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
> Released state dir lock for task 0_1
> 25-12-04T13:01:17.028 INFO o.a.k.s.p.i.StandbyTask:232 - stream-thread
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
> standby-task [0_1] Closed clean{noformat}
> What is it that triggers this "recycling" of stand-by tasks?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)