[ https://issues.apache.org/jira/browse/KAFKA-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217736#comment-17217736 ]
Bill Bejeck commented on KAFKA-10563: ------------------------------------- [~ableegoldman] I'm going to set the fix version field to 2.8 since this isn't a blocker and it looks like KIP-671 KIP-663 will not make the release. > Make sure task directories don't remain locked by dead threads > -------------------------------------------------------------- > > Key: KAFKA-10563 > URL: https://issues.apache.org/jira/browse/KAFKA-10563 > Project: Kafka > Issue Type: Bug > Components: streams > Reporter: A. Sophie Blee-Goldman > Priority: Major > Fix For: 2.7.0 > > > Most common/expected exceptions within Streams are handled gracefully, and > the thread will make sure to clean up all resources such as task locks during > shutdown. However, there are some instances where an unexpected exception > such as an IllegalStateException can leave some resources orphaned. > We have seen this happen to task directories after an IllegalStateException > is hit during the TaskManager's rebalance handling logic – the Thread shuts > down, but loses track of some tasks before unlocking them. This blocks any > further work on that task by any other thread in the same instance. > Previously we decided that this was "ok" because an IllegalStateException > means all bets are off. But with the upcoming work of KIP-663 and KIP-671, > users will be able to react smartly on dying threads and replace them with > new ones, making it more important than ever to ensure that the application > can continue on with no lasting repercussions of a thread death. If we allow > users to revive/replace a thread that dies due to IllegalStateException, that > thread should not be blocked from doing any work by the ghost of its > predecessor. > It might be easiest to just add some logic to the cleanup thread to verify > all the existing locks against the list of live threads, and remove any > zombie locks. But we probably want to do this purging more frequently than > the cleanup thread runs (10min by default) – so maybe we can leverage the > work in KIP-671 and have each thread purge any locks still owned by it after > the uncaught exception handler runs, but before the thread dies. -- This message was sent by Atlassian Jira (v8.3.4#803005)