[ 
https://issues.apache.org/jira/browse/KAFKA-8831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916186#comment-16916186
 ] 

Sophie Blee-Goldman commented on KAFKA-8831:
--------------------------------------------

Ah, is this the one that's trace because it will supposedly spam the logs? So 
the problem was that you had two non-isolated instances set to the same state 
dir? That seems to make sense since one of them would grab the lock and the 
other would be stuck retrying forever. I think Streams tends to assume only one 
instance per machine, so we don't even consider the deadlock of two instances 
trying for the same file lock. 

We definitely shouldn't just retry infinitely while logging only at the lowest 
level. Someone with more context ([~guozhang]) will have to chime in as to why 
we expect to see this LockException so often it will spam the logs and 
necessitates indefinite retries. At the very least, we should log at a higher 
level if we are retrying for the Nth time for some large N (again, not enough 
context to know what a reasonable value would be). Personally, I feel we should 
just rethrow the exception if we have retried past some threshold rather than 
just spin in deadlock

> Joining a new instance sometimes does not cause rebalancing
> -----------------------------------------------------------
>
>                 Key: KAFKA-8831
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8831
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Chris Pettitt
>            Assignee: Chris Pettitt
>            Priority: Major
>         Attachments: StandbyTaskTest.java, fail.log
>
>
> See attached log. The application is in a REBALANCING state. The second 
> instance joins a bit after the first instance (~250ms). The group coordinator 
> says it is going to rebalance but nothing happens. The first instance gets 
> all partitions (2). The application transitions to RUNNING.
> See attached test, which starts one client and then starts another about 
> 250ms later. This seems to consistently repro the issue for me.
> This is blocking my work on KAFKA-8755, so I'm inclined to pick it up



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to