[ https://issues.apache.org/jira/browse/KAFKA-19242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sanghyeok An updated KAFKA-19242: --------------------------------- Description: *Motivation* While investigating “events skipped in group rebalancing” (spring‑projects/spring‑kafka#3703) I discovered a race condition between - the main poll/commit thread, and - the consumer‑coordinator heartbeat thread. If the main thread enters `ConsumerCoordinator.sendOffsetCommitRequest()` while the heartbeat thread is finishing a rebalance (`SyncGroupResponseHandler.handle()`), the group state transitions in the following order: !image-2025-05-12-21-43-07-656.png|width=761,height=448! Because we read the state twice without a lock: 1. `generationIfStable()` returns `null` (state still `COMPLETING_REBALANCE`), 2. the heartbeat thread flips the state to `STABLE`, 3. the main thread re‑checks with `rebalanceInProgress()` and wrongly decides that a rebalance is still active, 4. a spurious `CommitFailedException` is returned even though the commit could succeed. *Impact* - The exception is semantically wrong: the consumer is in a stable group, but reports failure. - Frameworks and applications that rely on the semantics of `CommitFailedException` and `RetryableCommitException` (for example `Spring Kafka`) take the wrong code path, which can ultimately skip the events and break “at‑most‑once” guarantees. *Fix* We enlarge the synchronized block in `ConsumerCoordinator.sendOffsetCommitRequest()` so that the consumer group state is examined atomically with respect to the heartbeat thread: was: Motivation While investigating “events skipped in group rebalancing” (spring‑projects/spring‑kafka#3703) I discovered a race condition between - the main poll/commit thread, and - the consumer‑coordinator heartbeat thread. If the main thread enters `ConsumerCoordinator.sendOffsetCommitRequest()` while the heartbeat thread is finishing a rebalance (`SyncGroupResponseHandler.handle()`), the group state transitions in the following order: !image-2025-05-12-21-43-07-656.png|width=761,height=448! Because we read the state twice without a lock: 1. `generationIfStable()` returns `null` (state still `COMPLETING_REBALANCE`), 2. the heartbeat thread flips the state to `STABLE`, 3. the main thread re‑checks with `rebalanceInProgress()` and wrongly decides that a rebalance is still active, 4. a spurious `CommitFailedException` is returned even though the commit could succeed. For more details, please refer to sequence diagram below. <img width="1494" alt="image" src="https://github.com/user-attachments/assets/90f19af5-5e2d-4566-aece-ef764df2d89c" /> Impact - The exception is semantically wrong: the consumer is in a stable group, but reports failure. - Frameworks and applications that rely on the semantics of `CommitFailedException` and `RetryableCommitException` (for example `Spring Kafka`) take the wrong code path, which can ultimately skip the events and break “at‑most‑once” guarantees. Fix We enlarge the synchronized block in `ConsumerCoordinator.sendOffsetCommitRequest()` so that the consumer group state is examined atomically with respect to the heartbeat thread: > Fix commit bugs caused by race condition during rebalancing. > ------------------------------------------------------------ > > Key: KAFKA-19242 > URL: https://issues.apache.org/jira/browse/KAFKA-19242 > Project: Kafka > Issue Type: Bug > Components: clients > Reporter: sanghyeok An > Assignee: sanghyeok An > Priority: Major > Attachments: image-2025-05-05-19-19-06-147.png, > image-2025-05-12-21-43-07-656.png > > > *Motivation* > While investigating “events skipped in group rebalancing” > (spring‑projects/spring‑kafka#3703) I discovered a race condition between > - the main poll/commit thread, and > - the consumer‑coordinator heartbeat thread. > If the main thread enters `ConsumerCoordinator.sendOffsetCommitRequest()` > while the heartbeat thread is finishing a rebalance > (`SyncGroupResponseHandler.handle()`), the group state transitions in the > following order: > !image-2025-05-12-21-43-07-656.png|width=761,height=448! > Because we read the state twice without a lock: > 1. `generationIfStable()` returns `null` (state still `COMPLETING_REBALANCE`), > 2. the heartbeat thread flips the state to `STABLE`, > 3. the main thread re‑checks with `rebalanceInProgress()` and wrongly decides > that a rebalance is still active, > 4. a spurious `CommitFailedException` is returned even though the commit > could succeed. > > *Impact* > - The exception is semantically wrong: the consumer is in a stable group, > but reports failure. > - Frameworks and applications that rely on the semantics of > `CommitFailedException` and `RetryableCommitException` (for example `Spring > Kafka`) take the wrong code path, which can ultimately skip the events and > break “at‑most‑once” guarantees. > > *Fix* > We enlarge the synchronized block in > `ConsumerCoordinator.sendOffsetCommitRequest()` so that the consumer group > state is examined atomically with respect to the heartbeat thread: > -- This message was sent by Atlassian Jira (v8.20.10#820010)