[ https://issues.apache.org/jira/browse/KAFKA-8831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916186#comment-16916186 ]
Sophie Blee-Goldman commented on KAFKA-8831: -------------------------------------------- Ah, is this the one that's trace because it will supposedly spam the logs? So the problem was that you had two non-isolated instances set to the same state dir? That seems to make sense since one of them would grab the lock and the other would be stuck retrying forever. I think Streams tends to assume only one instance per machine, so we don't even consider the deadlock of two instances trying for the same file lock. We definitely shouldn't just retry infinitely while logging only at the lowest level. Someone with more context ([~guozhang]) will have to chime in as to why we expect to see this LockException so often it will spam the logs and necessitates indefinite retries. At the very least, we should log at a higher level if we are retrying for the Nth time for some large N (again, not enough context to know what a reasonable value would be). Personally, I feel we should just rethrow the exception if we have retried past some threshold rather than just spin in deadlock > Joining a new instance sometimes does not cause rebalancing > ----------------------------------------------------------- > > Key: KAFKA-8831 > URL: https://issues.apache.org/jira/browse/KAFKA-8831 > Project: Kafka > Issue Type: Bug > Components: streams > Reporter: Chris Pettitt > Assignee: Chris Pettitt > Priority: Major > Attachments: StandbyTaskTest.java, fail.log > > > See attached log. The application is in a REBALANCING state. The second > instance joins a bit after the first instance (~250ms). The group coordinator > says it is going to rebalance but nothing happens. The first instance gets > all partitions (2). The application transitions to RUNNING. > See attached test, which starts one client and then starts another about > 250ms later. This seems to consistently repro the issue for me. > This is blocking my work on KAFKA-8755, so I'm inclined to pick it up -- This message was sent by Atlassian Jira (v8.3.2#803003)