[
https://issues.apache.org/jira/browse/KAFKA-13419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624866#comment-17624866
]
A. Sophie Blee-Goldman commented on KAFKA-13419:
------------------------------------------------
{quote} can we just treat the ownedPartition in previous generation legal if
there are no same partition claimed by other member?
{quote}
Huh, I thought that's already what the cooperative assignor does? Maybe we
intentionally left/took it out of the constrained case algorithm for some
reason? Or possibly we had just discussed doing this and never did, either way
I definitely remember this specific handling logic coming up during the
recent(ish) optimizations that I worked on with [~showuon]
I'll check out the code I guess
> sync group failed with rebalanceInProgress error might cause out-of-date
> ownedPartition in Cooperative protocol
> ---------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-13419
> URL: https://issues.apache.org/jira/browse/KAFKA-13419
> Project: Kafka
> Issue Type: Bug
> Components: clients
> Affects Versions: 3.0.0
> Reporter: Luke Chen
> Assignee: Luke Chen
> Priority: Major
> Fix For: 3.1.0
>
>
> In KAFKA-13406, we found there's user got stuck when in rebalancing with
> cooperative sticky assignor. The reason is the "ownedPartition" is
> out-of-date, and it failed the cooperative assignment validation.
> Investigate deeper, I found the root cause is we didn't reset generation and
> state after sync group fail. In KAFKA-12983, we fixed the issue that the
> onJoinPrepare is not called in resetStateAndRejoin method. And it causes the
> ownedPartition not get cleared. But there's another case that the
> ownedPartition will be out-of-date. Here's the example:
> # consumer A joined and synced group successfully with generation 1
> # New rebalance started with generation 2, consumer A joined successfully,
> but somehow, consumer A doesn't send out sync group immediately
> # other consumer completed sync group successfully in generation 2, except
> consumer A.
> # After consumer A send out sync group, the new rebalance start, with
> generation 3. So consumer A got REBALANCE_IN_PROGRESS error with sync group
> response
> # When receiving REBALANCE_IN_PROGRESS, we re-join the group, with
> generation 3, with the assignment (ownedPartition) in generation 1.
> # So, now, we have out-of-date ownedPartition sent, with unexpected results
> happened
>
> We might want to do *resetStateAndRejoin* when *RebalanceInProgressException*
> errors happend in *sync group*. Because when we got sync group error, it
> means, join group passed, and other consumers (and the leader) might already
> completed this round of rebalance. The assignment distribution this consumer
> have is already out-of-date.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)