Mikael Carlstedt created KAFKA-19960:
----------------------------------------

             Summary: Spurious failure to close StateDirectory due to some task 
directories still locked
                 Key: KAFKA-19960
                 URL: https://issues.apache.org/jira/browse/KAFKA-19960
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 3.8.1
            Reporter: Mikael Carlstedt


We are seeing random failures to close state directories while closing a 
kafka-streams application in a test environment.

*Preconditions:*
 * Two state stores
 * Three input partitions
 * Stand-by replication enabled (NB: we have not been able to reproduce without 
stand-by replication)
 * Two instances running on a single host with different state directory.

The application instances are started before each test case is executed, and 
then closed when the test case has completed. Most of the time it works well 
without any errors logged, but sometimes we see this error message when closing 
an application instance:

 
{noformat}
25-12-04T13:01:18.711 ERROR o.a.k.s.p.i.StateDirectory:397 - Some task 
directories still locked while closing state, this indicates unclean shutdown: 
{0_2=     StreamsThread threadId: 
messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
TaskManager
        MetadataState:
        Tasks:
, 0_0=  StreamsThread threadId: 
messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1
TaskManager
        MetadataState:
        Tasks:
}{noformat}
 
 * is has a knock-on effect on all of the following test cases, which fail with 
this error message:

 
{noformat}
25-12-04T13:01:28.684 ERROR s.s.n.k.k.i.KafkaClient:179 - Unhandled exception 
in Kafka streams application messages.nobill.campaigns.delay-queue
org.apache.kafka.streams.errors.ProcessorStateException: Error opening store 
rate-limiterpriority-chronological-store-0 at location 
/tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0
        at 
org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:330)

...

Caused by: org.rocksdb.RocksDBException: lock hold by current process, acquire 
time 1764849665 acquiring thread 14274883584: 
/tmp/node2/messages.nobill.campaigns.delay-queue/0_2/rocksdb/rate-limiterpriority-chronological-store-0/LOCK:
 No locks available
        at org.rocksdb.RocksDB.open(Native Method)
        at org.rocksdb.RocksDB.open(RocksDB.java:307)
        at 
org.apache.kafka.streams.state.internals.RocksDBStore.openRocksDB(RocksDBStore.java:324)
        ... 19 common frames omitted{noformat}
*Observation:*

 
 * Prior to the error, the two stand-by tasks that fail to release their locks 
are "closed and recycled":

{noformat}
25-12-04T13:01:18.080  INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread 
[messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
 standby-task [0_2] Suspended running
25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:633 - 
stream-thread 
[messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
 standby-task [0_2] Recycling state for STANDBY task 0_2.
25-12-04T13:01:18.080 DEBUG o.a.k.s.p.i.ProcessorStateManager:644 - 
stream-thread 
[messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
 standby-task [0_2] Clearing all store caches registered in the state manager: 
...
25-12-04T13:01:18.080  INFO o.a.k.s.p.i.StandbyTask:254 - stream-thread 
[messages.nobill.campaigns.delay-queue-78f61f55-4863-4c4f-913c-bd398eceed0e-StreamThread-1]
 standby-task [0_2] Closed and recycled state{noformat}

 * By comparison, the third task is "closed clean":

{noformat}
25-12-04T13:01:17.024  INFO o.a.k.s.p.i.StandbyTask:149 - stream-thread 
[messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
 standby-task [0_1] Suspended running
25-12-04T13:01:17.024 DEBUG o.a.k.s.p.i.ProcessorStateManager:585 - 
stream-thread 
[messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
 standby-task [0_1] Closing its state manager and all the registered state 
stores: ...
25-12-04T13:01:17.028 DEBUG o.a.k.s.p.i.StateDirectory:377 - stream-thread 
[messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
 Released state dir lock for task 0_1
25-12-04T13:01:17.028  INFO o.a.k.s.p.i.StandbyTask:232 - stream-thread 
[messages.nobill.campaigns.delay-queue-bff50025-9296-45b7-ab1a-43480aee6f66-StreamThread-1]
 standby-task [0_1] Closed clean{noformat}

What is it that triggers this "recycling" of stand-by tasks?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to