philipnee opened a new pull request, #13550: URL: https://github.com/apache/kafka/pull/13550
This is a really long story, but the incident started in KAFKA-13419 when we observed a member sending out a topic partition owned from the previous generation when a member missed a rebalance cycle due to REBALANCE_IN_PROGRESS. Ideally, the member should continue holding onto its partition as long as there's no other owner with a younger generation; however, we need to be defensive about this approach because we aren't sure if the partition has already been assigned to other members. Therefore, it is safest for us to only honor the members with the highest generation and the previous generation during the assignment phase. In this PR, I made 2 major changes 1. In the assignor: we now honor partition owner that is only on its max - 1 generation as long as there's no other owner with a younger generation to that partition. (younger = higher generationId) 2. After getting REBALANCE_IN_PROGRESS sync group error, we immediately reset its generation so that we could ensure to claim lose for all of the owned partition if member doesn't re-join in timely member. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org