[
https://issues.apache.org/jira/browse/KAFKA-13891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Philip Nee resolved KAFKA-13891.
--------------------------------
Fix Version/s: 3.5.0
3.4.1
(was: 3.6.0)
Resolution: Fixed
> sync group failed with rebalanceInProgress error cause rebalance many rounds
> in coopeartive
> -------------------------------------------------------------------------------------------
>
> Key: KAFKA-13891
> URL: https://issues.apache.org/jira/browse/KAFKA-13891
> Project: Kafka
> Issue Type: Bug
> Components: clients
> Affects Versions: 3.0.0
> Reporter: Shawn Wang
> Assignee: Philip Nee
> Priority: Major
> Fix For: 3.5.0, 3.4.1
>
>
> This issue was first found in
> [KAFKA-13419|https://issues.apache.org/jira/browse/KAFKA-13419]
> But the previous PR forgot to reset generation when sync group failed with
> rebalanceInProgress error. So the previous bug still exists and it may cause
> consumer to rebalance many rounds before final stable.
> Here's the example ({*}bold is added{*}):
> # consumer A joined and synced group successfully with generation 1 *( with
> ownedPartition P1/P2 )*
> # New rebalance started with generation 2, consumer A joined successfully,
> but somehow, consumer A doesn't send out sync group immediately
> # other consumer completed sync group successfully in generation 2, except
> consumer A.
> # After consumer A send out sync group, the new rebalance start, with
> generation 3. So consumer A got REBALANCE_IN_PROGRESS error with sync group
> response
> # When receiving REBALANCE_IN_PROGRESS, we re-join the group, with
> generation 3, with the assignment (ownedPartition) in generation 1.
> # So, now, we have out-of-date ownedPartition sent, with unexpected results
> happened
> # *After the generation-3 rebalance, consumer A got P3/P4 partition. the
> ownedPartition is ignored because of old generation.*
> # *consumer A revoke P1/P2 and re-join to start a new round of rebalance*
> # *if some other consumer C failed to syncGroup before consumer A's
> joinGroup. the same issue will happens again and result in many rounds of
> rebalance before stable*
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)