Sophie Blee-Goldman created KAFKA-10563:
-------------------------------------------
Summary: Make sure task directories don't remain locked by dead
threads
Key: KAFKA-10563
URL: https://issues.apache.org/jira/browse/KAFKA-10563
Project: Kafka
Issue Type: Bug
Components: streams
Reporter: Sophie Blee-Goldman
Fix For: 2.7.0
Most common/expected exceptions within Streams are handled gracefully, and the
thread will make sure to clean up all resources such as task locks during
shutdown. However, there are some instances where an unexpected exception such
as an IllegalStateException can leave some resources orphaned.
We have seen this happen to task directories after an IllegalStateException is
hit during the TaskManager's rebalance handling logic – the Thread shuts down,
but loses track of some tasks before unlocking them. This blocks any further
work on that task by any other thread in the same instance.
Previously we decided that this was "ok" because an IllegalStateException means
all bets are off. But with the upcoming work of KIP-663 and KIP-671, users will
be able to react smartly on dying threads and replace them with new ones,
making it more important than ever to ensure that the application can continue
on with no lasting repercussions of a thread death. If we allow users to
revive/replace a thread that dies due to IllegalStateException, that thread
should not be blocked from doing any work by the ghost of its predecessor.
It might be easiest to just add some logic to the cleanup thread to verify all
the existing locks against the list of live threads, and remove any zombie
locks. But we probably want to do this purging more frequently than the cleanup
thread runs (10min by default) – so maybe we can leverage the work in KIP-671
and have each thread purge any locks still owned by it after the uncaught
exception handler runs, but before the thread dies.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)