[jira] [Updated] (KAFKA-13419) sync group failed with retriable error might cause out-of-date ownedPartition in Cooperative protocol

Luke Chen (Jira) Fri, 29 Oct 2021 05:34:06 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-13419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Luke Chen updated KAFKA-13419:
------------------------------
    Description: 
In KAFKA-13406, we found there's user got stuck when in rebalancing with 
cooperative sticky assignor. The reason is the "ownedPartition" is out-of-date, 
and it failed the cooperative assignment validation.

Investigate deeper, I found the root cause is we didn't reset generation and 
state after sync group fail. In KAFKA-12983, we fixed the issue that the 
onJoinPrepare is not called in resetStateAndRejoin method. And it causes the 
ownedPartition not get cleared. But there's another case that the 
ownedPartition will be out-of-date. Here's the example:
 # consumer A joined and synced group successfully with generation 1
 # New rebalance started with generation 2, consumer A joined successfully, but 
somehow, consumer A doesn't send out sync group immediately
 # other consumer completed sync group successfully in generation 2, except 
consumer A.
 # After consumer A send out sync group, the new rebalance start, with 
generation 3. So consumer A got REBALANCE_IN_PROGRESS error with sync group 
response
 # When receiving REBALANCE_IN_PROGRESS, we re-join the group, with generation 
3, with the assignment (ownedPartition) in generation 1.
 # So, now, we have out-of-date ownedPartition sent, with unexpected results 
happened

 

We might want to do *resetStateAndRejoin* when *RebalanceInProgressException* 
errors happend in *sync group*. Because when we got sync group error, it means, 
join group passed, and other consumers (and the leader) might already completed 
this round of rebalance. The assignment distribution this consumer have is 
already out-of-date.

 

  was:
In KAFKA-13406, we found there's user got stuck when in rebalancing with 
cooperative sticky assignor. The reason is the "ownedPartition" is out-of-date, 
and it failed the cooperative assignment validation.

Investigate deeper, I found the root cause is we didn't reset generation and 
state after sync group fail. In KAFKA-12983, we fixed the issue that the 
onJoinPrepare is not called in resetStateAndRejoin method. And it causes the 
ownedPartition not get cleared. But there's another case that the 
ownedPartition will be out-of-date. Here's the example:
 # consumer A joined and synced group successfully with generation 1
 # New rebalance started with generation 2, consumer A joined successfully, but 
somehow, consumer A doesn't send out sync group immediately
 # other consumer completed sync group successfully in generation 2, except 
consumer A.
 # After consumer A send out sync group, the new rebalance start, with 
generation 3. So consumer A got REBALANCE_IN_PROGRESS error with sync group 
response
 # When receiving REBALANCE_IN_PROGRESS, we re-join the group, with generation 
3, with the assignment (ownedPartition) in generation 1.
 # So, now, we have out-of-date ownedPartition sent, with unexpected results 
happened

 

We might want to do resetStateAndRejoin when retriable errors happend in *sync 
group*. Because when we got sync group error, it means, join group passed, and 
other consumers (and the leader) might already completed this round of 
rebalance. The assignment distribution this consumer have is already 
out-of-date.

 

The errors should resetStateAndRejoin in sync group are:
{code:java}
if (exception instanceof UnknownMemberIdException ||
    exception instanceof IllegalGenerationException ||
    exception instanceof RebalanceInProgressException ||
    exception instanceof MemberIdRequiredException)
    continue;
else if (!future.isRetriable())
    throw exception;
{code}


> sync group failed with retriable error might cause out-of-date ownedPartition 
> in Cooperative protocol
> -----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13419
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13419
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 3.0.0
>            Reporter: Luke Chen
>            Assignee: Luke Chen
>            Priority: Major
>
> In KAFKA-13406, we found there's user got stuck when in rebalancing with 
> cooperative sticky assignor. The reason is the "ownedPartition" is 
> out-of-date, and it failed the cooperative assignment validation.
> Investigate deeper, I found the root cause is we didn't reset generation and 
> state after sync group fail. In KAFKA-12983, we fixed the issue that the 
> onJoinPrepare is not called in resetStateAndRejoin method. And it causes the 
> ownedPartition not get cleared. But there's another case that the 
> ownedPartition will be out-of-date. Here's the example:
>  # consumer A joined and synced group successfully with generation 1
>  # New rebalance started with generation 2, consumer A joined successfully, 
> but somehow, consumer A doesn't send out sync group immediately
>  # other consumer completed sync group successfully in generation 2, except 
> consumer A.
>  # After consumer A send out sync group, the new rebalance start, with 
> generation 3. So consumer A got REBALANCE_IN_PROGRESS error with sync group 
> response
>  # When receiving REBALANCE_IN_PROGRESS, we re-join the group, with 
> generation 3, with the assignment (ownedPartition) in generation 1.
>  # So, now, we have out-of-date ownedPartition sent, with unexpected results 
> happened
>  
> We might want to do *resetStateAndRejoin* when *RebalanceInProgressException* 
> errors happend in *sync group*. Because when we got sync group error, it 
> means, join group passed, and other consumers (and the leader) might already 
> completed this round of rebalance. The assignment distribution this consumer 
> have is already out-of-date.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-13419) sync group failed with retriable error might cause out-of-date ownedPartition in Cooperative protocol

Reply via email to