[jira] [Updated] (KAFKA-19242) Fix commit bugs caused by race condition during rebalancing.

sanghyeok An (Jira) Mon, 12 May 2025 05:44:11 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-19242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


sanghyeok An updated KAFKA-19242:
---------------------------------
     Attachment: image-2025-05-12-21-43-07-656.png
    Description: 
*### Motivation*
While investigating “events skipped in group rebalancing” 
(spring‑projects/spring‑kafka#3703) I discovered a race condition between
 - the main poll/commit thread, and
 - the consumer‑coordinator heartbeat thread.

If the main thread enters `ConsumerCoordinator.sendOffsetCommitRequest()` while 
the heartbeat thread is finishing a rebalance 
(`SyncGroupResponseHandler.handle()`), the group state transitions in the 
following order:

!image-2025-05-12-21-43-07-656.png|width=761,height=448!

Because we read the state twice without a lock:
1. `generationIfStable()` returns `null` (state still `COMPLETING_REBALANCE`),
2. the heartbeat thread flips the state to `STABLE`,
3. the main thread re‑checks with `rebalanceInProgress()` and wrongly decides 
that a rebalance is still active,
4. a spurious `CommitFailedException` is returned even though the commit could 
succeed.

For more details, please refer to sequence diagram below.  <img
width="1494" alt="image"
src="https://github.com/user-attachments/assets/90f19af5-5e2d-4566-aece-ef764df2d89c";
/>

 

*# Impact*
 - The exception is semantically wrong: the consumer is in a stable group, but 
reports failure.
 - Frameworks and applications that rely on the semantics of 
`CommitFailedException` and `RetryableCommitException` (for example `Spring 
Kafka`) take the wrong code path, which can ultimately skip the events and 
break “at‑most‑once” guarantees.

*# Fix*
We enlarge the synchronized block in 
`ConsumerCoordinator.sendOffsetCommitRequest()` so that the consumer group 
state is examined atomically with respect to the heartbeat thread:

 

  was:
### Motivation
While investigating “events skipped in group rebalancing” 
(spring‑projects/spring‑kafka#3703) I discovered a race condition between
- the main poll/commit thread, and
- the consumer‑coordinator heartbeat thread.

If the main thread enters `ConsumerCoordinator.sendOffsetCommitRequest()` while 
the heartbeat thread is finishing a rebalance 
(`SyncGroupResponseHandler.handle()`), the group state transitions in the 
following order:

```
COMPLETING_REBALANCE  →  (race window)  →  STABLE
```
Because we read the state twice without a lock:
1. `generationIfStable()` returns `null` (state still `COMPLETING_REBALANCE`),
2. the heartbeat thread flips the state to `STABLE`,
3. the main thread re‑checks with `rebalanceInProgress()` and wrongly decides 
that a rebalance is still active,
4. a spurious `CommitFailedException` is returned even though the commit could 
succeed.


For more details, please refer to sequence diagram below.  <img
width="1494" alt="image"
src="https://github.com/user-attachments/assets/90f19af5-5e2d-4566-aece-ef764df2d89c";
/>


### Impact
- The exception is semantically wrong: the consumer is in a stable group, but 
reports failure.
- Frameworks and applications that rely on the semantics of 
`CommitFailedException` and `RetryableCommitException` (for example `Spring 
Kafka`) take the wrong code path, which can ultimately skip the events and 
break “at‑most‑once” guarantees.


### Fix
We enlarge the synchronized block in 
`ConsumerCoordinator.sendOffsetCommitRequest()` so that the consumer group 
state is examined atomically with respect to the heartbeat thread:

 


> Fix commit bugs caused by race condition during rebalancing.
> ------------------------------------------------------------
>
>                 Key: KAFKA-19242
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19242
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>            Reporter: sanghyeok An
>            Assignee: sanghyeok An
>            Priority: Major
>         Attachments: image-2025-05-05-19-19-06-147.png, 
> image-2025-05-12-21-43-07-656.png
>
>
> *### Motivation*
> While investigating “events skipped in group rebalancing” 
> (spring‑projects/spring‑kafka#3703) I discovered a race condition between
>  - the main poll/commit thread, and
>  - the consumer‑coordinator heartbeat thread.
> If the main thread enters `ConsumerCoordinator.sendOffsetCommitRequest()` 
> while the heartbeat thread is finishing a rebalance 
> (`SyncGroupResponseHandler.handle()`), the group state transitions in the 
> following order:
> !image-2025-05-12-21-43-07-656.png|width=761,height=448!
> Because we read the state twice without a lock:
> 1. `generationIfStable()` returns `null` (state still `COMPLETING_REBALANCE`),
> 2. the heartbeat thread flips the state to `STABLE`,
> 3. the main thread re‑checks with `rebalanceInProgress()` and wrongly decides 
> that a rebalance is still active,
> 4. a spurious `CommitFailedException` is returned even though the commit 
> could succeed.
> For more details, please refer to sequence diagram below.  <img
> width="1494" alt="image"
> src="https://github.com/user-attachments/assets/90f19af5-5e2d-4566-aece-ef764df2d89c";
> />
>  
> *# Impact*
>  - The exception is semantically wrong: the consumer is in a stable group, 
> but reports failure.
>  - Frameworks and applications that rely on the semantics of 
> `CommitFailedException` and `RetryableCommitException` (for example `Spring 
> Kafka`) take the wrong code path, which can ultimately skip the events and 
> break “at‑most‑once” guarantees.
> *# Fix*
> We enlarge the synchronized block in 
> `ConsumerCoordinator.sendOffsetCommitRequest()` so that the consumer group 
> state is examined atomically with respect to the heartbeat thread:
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-19242) Fix commit bugs caused by race condition during rebalancing.

Reply via email to