[
https://issues.apache.org/jira/browse/KAFKA-19960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18045314#comment-18045314
]
Matthias J. Sax commented on KAFKA-19960:
-----------------------------------------
Thanks. The timeframe of the logs is rather short. But I'll spend some time to
study it.
We did in parallel identify some other bug
(https://issues.apache.org/jira/browse/KAFKA-19994) which seems to have similar
symptoms, ie, it eventually results in a LOCK exception from RocksDB. – The
root cause of K19994 is a `TimeoutExecption` from the clients, which Kafka
Streams retires until it hits `task.timeout.ms` and gives up, but does not do a
proper resource cleanup.
Not sure if it's the same thing you are hitting or not. K19994 should be
independent of standby tasks though. Do you see any other exception before you
hit the LOCK exception from RocksDB?
> Spurious failure to close StateDirectory due to some task directories still
> locked
> ----------------------------------------------------------------------------------
>
> Key: KAFKA-19960
> URL: https://issues.apache.org/jira/browse/KAFKA-19960
> Project: Kafka
> Issue Type: Bug
> Components: streams
> Affects Versions: 3.8.1
> Reporter: Mikael Carlstedt
> Priority: Critical
> Attachments: debug-1.log, debug.log
>
>
> We are seeing random failures to close state directories while closing a
> kafka-streams application in a test environment.
> *Preconditions:*
> * Two state stores
> * Three input partitions
> * Stand-by replication enabled (NB: we have not been able to reproduce
> without stand-by replication)
> * Two instances running on a single host with different state directory.
> The application instances are started before each test case is executed, and
> then closed when the test case has completed. Most of the time it works well
> without any errors logged, but sometimes we see this error message when
> closing an application instance:
>
> {noformat}
> 25-12-04T13:01:18.711 ERROR o.a.k.s.p.i.StateDirectory:397 - Some task
> directories still locked while closing state, this indicates unclean
> shutdown: {0_2= StreamsThread threadId:
> messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
> TaskManager
> MetadataState:
> Tasks:
> , 0_0= StreamsThread threadId:
> messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
> TaskManager
> MetadataState:
> Tasks:
> }{noformat}
>
> * is has a knock-on effect on all of the following test cases, which fail
> with this error message:
>
> {noformat}
> 25-12-04T13:01:28.684 ERROR s.s.n.k.k.i.KafkaClient:179 - Unhandled exception
> in Kafka streams application messages.nobill.campaigns.delay-queue
> org.apache.kafka.streams.errors.ProcessorStateException: Error opening store
> rate-limiterpriority-chronological-store-0 at location
> /tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
> at
> org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:330)
> ...
> Caused by: org.rocksdb.RocksDBException: lock hold by current process,
> acquire time 1764849665 acquiring thread 14274883584:
> /tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0/LOCK:
> No locks available
> at org.rocksdb.RocksDB.open(Native Method)
> at org.rocksdb.RocksDB.open(RocksDB.java:307)
> at
> org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:324)
> ... 19 common frames omitted{noformat}
> *Observation:*
>
> * Prior to the error, the two stand-by tasks that fail to release their
> locks are "closed and recycled":
> {noformat}
> 25-12-04T13:01:18.080 INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
> standby-task [0_2] Suspended running
> 25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:633 -
> stream-thread
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
> standby-task [0_2] Recycling state for STANDBY task 0_2.
> 25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:644 -
> stream-thread
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
> standby-task [0_2] Clearing all store caches registered in the state
> manager: ...
> 25-12-04T13:01:18.080 INFO o.a.k.s.p.i.StandbyTask:254 - stream-thread
> [messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
> standby-task [0_2] Closed and recycled state{noformat}
> * By comparison, the third task is "closed clean":
> {noformat}
> 25-12-04T13:01:17.024 INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
> standby-task [0_1] Suspended running
> 25-12-04T13:01:17.024 DEBUG o.a.k.s.p.i.ProcessorStateManager:585 -
> stream-thread
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
> standby-task [0_1] Closing its state manager and all the registered state
> stores: ...
> 25-12-04T13:01:17.028 DEBUG o.a.k.s.p.i.StateDirectory:377 - stream-thread
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
> Released state dir lock for task 0_1
> 25-12-04T13:01:17.028 INFO o.a.k.s.p.i.StandbyTask:232 - stream-thread
> [messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
> standby-task [0_1] Closed clean{noformat}
> What is it that triggers this "recycling" of stand-by tasks?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)