Mikael Carlstedt created KAFKA-19960:
----------------------------------------
Summary: Spurious failure to close StateDirectory due to some task
directories still locked
Key: KAFKA-19960
URL: https://issues.apache.org/jira/browse/KAFKA-19960
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 3.8.1
Reporter: Mikael Carlstedt
We are seeing random failures to close state directories while closing a
kafka-streams application in a test environment.
*Preconditions:*
* Two state stores
* Three input partitions
* Stand-by replication enabled (NB: we have not been able to reproduce without
stand-by replication)
* Two instances running on a single host with different state directory.
The application instances are started before each test case is executed, and
then closed when the test case has completed. Most of the time it works well
without any errors logged, but sometimes we see this error message when closing
an application instance:
{noformat}
25-12-04T13:01:18.711 ERROR o.a.k.s.p.i.StateDirectory:397 - Some task
directories still locked while closing state, this indicates unclean shutdown:
{0_2= StreamsThread threadId:
messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
TaskManager
MetadataState:
Tasks:
, 0_0= StreamsThread threadId:
messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
TaskManager
MetadataState:
Tasks:
}{noformat}
* is has a knock-on effect on all of the following test cases, which fail with
this error message:
{noformat}
25-12-04T13:01:28.684 ERROR s.s.n.k.k.i.KafkaClient:179 - Unhandled exception
in Kafka streams application messages.nobill.campaigns.delay-queue
org.apache.kafka.streams.errors.ProcessorStateException: Error opening store
rate-limiterpriority-chronological-store-0 at location
/tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
at
org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:330)
...
Caused by: org.rocksdb.RocksDBException: lock hold by current process, acquire
time 1764849665 acquiring thread 14274883584:
/tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0/LOCK:
No locks available
at org.rocksdb.RocksDB.open(Native Method)
at org.rocksdb.RocksDB.open(RocksDB.java:307)
at
org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:324)
... 19 common frames omitted{noformat}
*Observation:*
* Prior to the error, the two stand-by tasks that fail to release their locks
are "closed and recycled":
{noformat}
25-12-04T13:01:18.080 INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread
[messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
standby-task [0_2] Suspended running
25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:633 -
stream-thread
[messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
standby-task [0_2] Recycling state for STANDBY task 0_2.
25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:644 -
stream-thread
[messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
standby-task [0_2] Clearing all store caches registered in the state manager:
...
25-12-04T13:01:18.080 INFO o.a.k.s.p.i.StandbyTask:254 - stream-thread
[messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
standby-task [0_2] Closed and recycled state{noformat}
* By comparison, the third task is "closed clean":
{noformat}
25-12-04T13:01:17.024 INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread
[messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
standby-task [0_1] Suspended running
25-12-04T13:01:17.024 DEBUG o.a.k.s.p.i.ProcessorStateManager:585 -
stream-thread
[messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
standby-task [0_1] Closing its state manager and all the registered state
stores: ...
25-12-04T13:01:17.028 DEBUG o.a.k.s.p.i.StateDirectory:377 - stream-thread
[messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
Released state dir lock for task 0_1
25-12-04T13:01:17.028 INFO o.a.k.s.p.i.StandbyTask:232 - stream-thread
[messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
standby-task [0_1] Closed clean{noformat}
What is it that triggers this "recycling" of stand-by tasks?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)