[jira] [Comment Edited] (KAFKA-19902) Consumer triggers OFFSET_OUT_OF_RANGE when committed offset uses stale epoch after leader change

Nicolae Marasoiu (Jira) Fri, 28 Nov 2025 14:59:09 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-19902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18041386#comment-18041386
 ]


Nicolae Marasoiu edited comment on KAFKA-19902 at 11/28/25 10:58 PM:
---------------------------------------------------------------------

h3. Primary Issues
 # *KAFKA-19902* ✅ *EXACT MATCH* - The main bug we investigated

 * 
 ** Consumer commits offset with stale epoch after leader change
 ** OFFSET_OUT_OF_RANGE when epoch metadata deleted by retention

 # *KAFKA-16248* ✅ *RELATED VARIANT* - Same bug in Kafka Streams

 * 
 ** Kafka Streams cannot populate leader epoch correctly in EOS mode
 ** Same root cause, different component

h3. Foundational Issues (Historical Context)
 # *KAFKA-6880* ✅ *FOUNDATIONAL* - Original zombie replica problem (2018)

 * 
 ** The problem that motivated KIP-320 epoch fencing
 ** KAFKA-19902 is an unintended consequence of the solution

 # *KAFKA-7395* ✅ *IMPLEMENTATION* - KIP-320 broker-side implementation (2018)

 * 
 ** Added epoch validation to replication protocol
 ** Introduced OffsetsForLeaderEpoch API that KAFKA-19902 bug affects

h3. Tangentially Related
 # *KAFKA-19633* ⚠️ *TANGENTIAL* - Kafka Connect zombie records during rebalance
 ** Related to zombie consumer problem domain
 ** NOT the same bug (different issue about ordering during rebalance)

h2. The Bug Family Tree

 {{KAFKA-6880 (2018): Zombie replicas }}

{{↓KIP-320: Epoch fencing solution }}

{{↓KAFKA-7395 (2018): Implementation }}

{{↓Bug introduced: allConsumed() uses wrong epoch }}

{{↓KAFKA-16248 (2023): Streams variant }}

{{↓KAFKA-19902 (2024): Consumer epoch bug discovered}}


{*}Key Finding{*}: KAFKA-19902 is an *unintended consequence* of the KIP-320 
epoch fencing mechanism that was designed to solve the zombie replica problem 
(KAFKA-6880). The solution introduced epoch validation, but the consumer 
implementation commits offsets with the wrong epoch ({{{}offsetEpoch{}}} 
instead of {{{}currentLeader.epoch{}}}), creating a new failure mode when 
retention deletes epoch metadata.


was (Author: nmarasoiu):
# KAFKA-19902: Epoch-Offset Mismatch Bug - Deep Technical Analysis

## Executive Summary

This document provides a comprehensive technical analysis of KAFKA-19902, a 
critical bug in Apache Kafka's consumer offset commit mechanism that can lead 
to massive message reprocessing or data loss when consumer offsets are 
validated after log segment retention.

**Bug Status**: Open (filed recently) 
**Severity**: High - Can cause OFFSET_OUT_OF_RANGE errors leading to consumer 
resets 
**Root Cause**: Incorrect epoch used in offset commits (offsetEpoch vs 
currentLeader.epoch) 
**Impact**: Consumer restart + segment deletion = massive reprocessing from 
earliest

---

## The Core Bug Mechanism

### Problem Statement

The 
[`allConsumed()`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java:775)
 method in `SubscriptionState.java` commits offsets with the wrong epoch:

```java
public synchronized Map<TopicPartition, OffsetAndMetadata> allConsumed() {
Map<TopicPartition, OffsetAndMetadata> allConsumed = new HashMap<>();
assignment.forEach((topicPartition, partitionState) -> {
if (partitionState.hasValidPosition())
allConsumed.put(topicPartition, new OffsetAndMetadata(
partitionState.position.offset,
partitionState.position.offsetEpoch, // ❌ BUG: Wrong epoch!
""));
});
return allConsumed;
}
```

**The bug**: Line 780 uses `position.offsetEpoch` instead of 
`position.currentLeader.epoch`.

### Understanding the Two Epochs

The 
[`FetchPosition`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java:1349)
 class has two distinct epoch fields:

```java
public static class FetchPosition {
public final long offset;
final Optional<Integer> offsetEpoch; // Epoch when record was WRITTEN
final Metadata.LeaderAndEpoch currentLeader; // Leader epoch when record was 
CONSUMED
}
```

#### 1. `offsetEpoch` - The Record Batch Epoch
- **Source**: `currentBatch.partitionLeaderEpoch()` from the fetch response
- **Meaning**: The leader epoch at the time the record was **written** to the 
log
- **Set in**: 
[`CompletedFetch.java:200`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/CompletedFetch.java:200)
```java
lastEpoch = maybeLeaderEpoch(currentBatch.partitionLeaderEpoch());
```
- **Updated to position in**: 
[`FetchCollector.java:180-183`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchCollector.java:180)
```java
SubscriptionState.FetchPosition nextPosition = new 
SubscriptionState.FetchPosition(
nextInLineFetch.nextFetchOffset(),
nextInLineFetch.lastEpoch(), // This is offsetEpoch (from batch)
position.currentLeader); // This is currentLeader (from metadata)
```

#### 2. `currentLeader.epoch` - The Current Leader Epoch
- **Source**: Metadata update when consumer discovers current partition leader
- **Meaning**: The leader epoch at the time the consumer is **consuming** from 
this partition
- **Updated**: During metadata updates and leader validation

### The Critical Difference

In a stable cluster, these two epochs are usually the same. However:

**Scenario: Leader change during consumption**

1. Consumer starts fetching from Leader A (epoch 5)
2. Fetch response contains records written during epoch 4
3. `offsetEpoch` = 4 (from the batch header)
4. `currentLeader.epoch` = 5 (from metadata)
5. Consumer commits offset with epoch **4** instead of **5**

**This creates a time bomb** 🕐💣

---

## The Failure Chain

### Step 1: Normal Operation with Stale Epoch

```
Time T0: Leader election completes
├─ Partition Leader: Broker 1, Epoch 5
├─ Consumer metadata: Leader epoch = 5
└─ Consumer fetches batch written in epoch 4
└─ position.offsetEpoch = 4
└─ position.currentLeader.epoch = 5

Time T1: Consumer commits offset 1000
├─ Committed: (offset=1000, leaderEpoch=4) ❌ Should be epoch 5!
└─ Stored in __consumer_offsets topic
```

### Step 2: Log Retention Deletes Old Segments

```
Time T2: Retention policy triggers (e.g., 7 days)
├─ Broker deletes log segments containing epoch 4
├─ Remaining segments: epoch 5, 6, 7...
└─ Epoch 4 metadata is GONE
```

### Step 3: Consumer Restarts and Validation Fails

```
Time T3: Consumer restarts
├─ Fetches committed offset from __consumer_offsets
│ └─ Reads: (offset=1000, leaderEpoch=4)
│
├─ Attempts to validate via OffsetsForLeaderEpoch request
│ └─ Sends: \{ partition: X, leaderEpoch: 4 }
│
├─ Broker response: UNDEFINED_EPOCH_OFFSET
│ └─ "I don't have epoch 4 anymore"
│
└─ Consumer behavior (depends on reset policy):
├─ If auto.offset.reset=earliest: Reprocesses ALL messages
├─ If auto.offset.reset=latest: Skips to end (data loss)
└─ If auto.offset.reset=none: Throws OffsetOutOfRangeException
```

---

## Code Flow Analysis

### Commit Path

1. **User calls** `commitSync()` or `commitAsync()`
- 
[`AsyncKafkaConsumer.java:1699`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/AsyncKafkaConsumer.java:1699)
- 
[`ClassicKafkaConsumer.java:731`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/ClassicKafkaConsumer.java:731)

2. **Commit manager invokes** `subscriptions.allConsumed()`
- 
[`CommitRequestManager.java:280`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/CommitRequestManager.java:280)
- 
[`ConsumerCoordinator.java:1182`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/ConsumerCoordinator.java:1182)

3. **Bug occurs** in `allConsumed()` ❌
- 
[`SubscriptionState.java:779-780`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java:779)
- Uses `position.offsetEpoch` instead of `position.currentLeader.epoch`

4. **Offset committed** with stale epoch to Kafka

### Validation Path

When consumer restarts:

1. **Position validation** triggered by metadata change
- 
[`OffsetFetcherUtils.java:198-205`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcherUtils.java:198)

2. **OffsetsForLeaderEpoch request** built with committed epoch
- 
[`OffsetsForLeaderEpochUtils.java:53-61`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetsForLeaderEpochUtils.java:53)
- Uses `fetchPosition.offsetEpoch` from committed offset

3. **Broker validation** fails if epoch deleted
- 
[`SubscriptionState.java:582-587`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java:582)

4. **Reset triggered** or exception thrown
- 
[`FetchCollector.java:336-356`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchCollector.java:336)

---

## The "Dangling Reference" Problem

This bug is analogous to a memory management issue:

| Concept | Memory Management | Kafka Bug |
|---------|------------------|-----------|
| **Reference** | Pointer to memory | Committed offset + epoch |
| **Object** | Allocated memory block | Log segment with epoch metadata |
| **Garbage Collection** | Memory freed | Segment deleted by retention |
| **Dangling Reference** | Pointer to freed memory | Offset references deleted 
epoch |
| **Consequence** | Segfault / undefined behavior | OFFSET_OUT_OF_RANGE / 
consumer reset |

### Why This is Hard to Fix at the Retention Layer

>From the conversation with Opus, we discussed making retention 
>"reference-aware":

```
❌ Naive Solution: Don't delete segments referenced by committed offsets

Problems:
1. Slow consumers become retention anchors (disk fills indefinitely)
2. Consumer groups are "soft state" (can exist forever even if inactive)
3. Cross-cutting concern (retention doesn't naturally see offset storage)
4. Abandoned consumer groups prevent cleanup forever
```

---

## The Correct Fix

### Change Location

File: 
[`clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java:775)

### Before (Buggy)

```java
public synchronized Map<TopicPartition, OffsetAndMetadata> allConsumed() {
Map<TopicPartition, OffsetAndMetadata> allConsumed = new HashMap<>();
assignment.forEach((topicPartition, partitionState) -> {
if (partitionState.hasValidPosition())
allConsumed.put(topicPartition, new OffsetAndMetadata(
partitionState.position.offset,
partitionState.position.offsetEpoch, // ❌ BUG
""));
});
return allConsumed;
}
```

### After (Fixed)

```java
public synchronized Map<TopicPartition, OffsetAndMetadata> allConsumed() {
Map<TopicPartition, OffsetAndMetadata> allConsumed = new HashMap<>();
assignment.forEach((topicPartition, partitionState) -> {
if (partitionState.hasValidPosition())
allConsumed.put(topicPartition, new OffsetAndMetadata(
partitionState.position.offset,
partitionState.position.currentLeader.epoch, // ✅ FIXED
""));
});
return allConsumed;
}
```

### Why This Works

1. **`currentLeader.epoch` is always recent** - It's from the current metadata
2. **Recent epochs are less likely to be deleted** - Retention targets old 
segments
3. **Validation will succeed** - Broker still has recent epoch metadata
4. **No architectural coupling** - Fix is surgical, in the commit path only

---

## Impact Analysis

### When the Bug Manifests

The bug requires this specific sequence:

```
1. Leader change occurs (creates epoch N+1)
2. Consumer fetches records written in epoch N
3. Consumer commits offset with epoch N (bug triggers)
4. Retention deletes segments containing epoch N
5. Consumer restarts
6. Validation fails (epoch N not found)
7. Consumer resets to earliest/latest
```

**Frequency factors**:
- **Leader changes**: Common in production (rolling restarts, broker failures, 
preferred leader election)
- **Retention duration**: Days to weeks typically
- **Consumer restart frequency**: Deployments, rebalances, crashes

**Likelihood**: **Medium-High** in production environments with:
- Active leader elections
- Long retention periods
- Frequent consumer restarts

### Real-World Impact

**Scenario 1: auto.offset.reset=earliest**
```
Consumer processes 1TB of data over 7 days
Offset committed: (offset=1B, epoch=5)
Retention deletes epoch 5 segments
Consumer restarts → resets to offset 0
Result: Reprocesses 1TB of data (massive load spike)
```

**Scenario 2: auto.offset.reset=latest**
```
Consumer at offset 1,000,000 (epoch 5)
Consumer goes down for maintenance
Retention deletes epoch 5 segments
Consumer restarts → skips to offset 10,000,000
Result: 9 million messages lost (data loss)
```

**Scenario 3: auto.offset.reset=none**
```
Consumer throws OffsetOutOfRangeException
Application fails to start
Result: Service outage
```

---

## Testing Gaps

Current tests in 
[`SubscriptionStateTest.java`](clients/src/test/java/org/apache/kafka/clients/consumer/internals/SubscriptionStateTest.java)
 and 
[`CommitRequestManagerTest.java`](clients/src/test/java/org/apache/kafka/clients/consumer/internals/CommitRequestManagerTest.java)
 do not cover:

1. **Epoch mismatch scenario**:
- Fetch from epoch N, commit with leader epoch N+1

2. **Validation after retention**:
- Committed epoch deleted, validation fails

3. **Integration test**:
- End-to-end scenario with retention + restart

### Suggested Test

```java
@Test
public void testAllConsumedUsesCurrentLeaderEpochNotOffsetEpoch() {
// Setup: Leader at epoch 5, records from epoch 4
Metadata.LeaderAndEpoch currentLeader = new Metadata.LeaderAndEpoch(
Optional.of(broker1), Optional.of(5)); // Current leader epoch

SubscriptionState.FetchPosition position = new SubscriptionState.FetchPosition(
100L, // offset
Optional.of(4), // offsetEpoch (from record batch)
currentLeader); // currentLeader.epoch = 5

state.seekValidated(tp0, position);

// Execute
Map<TopicPartition, OffsetAndMetadata> consumed = state.allConsumed();

// Verify: Should use currentLeader.epoch (5), not offsetEpoch (4)
OffsetAndMetadata offsetMeta = consumed.get(tp0);
assertEquals(100L, offsetMeta.offset());
assertEquals(Optional.of(5), offsetMeta.leaderEpoch()); // NOT 4!
}
```

---

## Related Issues

>From the JIRA search in the conversation:

- **KAFKA-16248**: Kafka Streams calls commitSync with explicit 
OffsetAndMetadata but does not populate the offset leader epoch
- **KAFKA-19633**: Kafka Connect connectors send out zombie records during 
rebalance
- **KAFKA-6880**: Zombie replicas must be fenced (KIP-320 epoch fencing)
- **KAFKA-7395**: Add fencing to replication protocol (KIP-320)

All relate to epoch fencing mechanisms introduced to prevent zombie 
consumers/leaders.

---

## Architectural Context

### The Design Tension

Kafka's epoch fencing (KIP-320) was added to solve the zombie consumer problem:

**Without epochs**:
```
1. Consumer A is leader, commits offset 100
2. Consumer A is partitioned (still alive, doesn't know it's fenced)
3. Consumer B becomes leader, commits offset 200
4. Consumer A reconnects, commits offset 150 (overwrites!)
5. Result: Lost offsets 151-200
```

**With epochs** (intended):
```
1. Consumer A at epoch 5, commits (offset=100, epoch=5)
2. Leader election → epoch 6
3. Consumer B at epoch 6, commits (offset=200, epoch=6)
4. Consumer A tries to commit (offset=150, epoch=5)
5. Broker rejects: "Stale epoch, fenced!"
6. Result: Offsets protected ✅
```

**With epochs** (actual bug):
```
1. Consumer fetches from epoch 6, records written in epoch 5
2. Consumer commits (offset=100, epoch=5) ❌ Should be epoch 6
3. Retention deletes epoch 5
4. Consumer restarts, validation fails
5. Result: Massive reprocessing or data loss ❌
```

### The Root Cause Philosophy

The bug stems from conflating two concepts:
- **When was this data written?** → `offsetEpoch`
- **Who am I currently talking to?** → `currentLeader.epoch`

For validation purposes, we care about "who am I talking to", not "when was 
this written".

---

## Recommendations

### Immediate (Fix the Bug)

1. **Change line 780** in `SubscriptionState.allConsumed()` to use 
`currentLeader.epoch`
2. **Add test case** covering epoch mismatch scenario
3. **Validate fix** doesn't break existing epoch fencing guarantees

### Short-term (Prevent Recurrence)

1. **Code review guideline**: Any OffsetAndMetadata construction must use 
currentLeader.epoch
2. **Linter rule**: Flag uses of `offsetEpoch` in commit paths
3. **Integration test**: Retention + restart + validation scenario

### Long-term (Architectural)

1. **Consider**: Should `FetchPosition` have two separate epoch fields?
- Rename `offsetEpoch` → `recordBatchEpoch` (clarify purpose)
- Make `currentLeader.epoch` the only validation epoch

2. **Consider**: Broker-side epoch caching
- Keep recent epoch metadata even after segment deletion
- Reduces validation failure window

3. **Documentation**: Clearly document epoch semantics in javadocs

---

## Conclusion

KAFKA-19902 is a **critical bug** with **high real-world impact**. The fix is 
**surgical** (one line change) but the implications are **broad** (affects all 
consumers using epoch-based validation).

The bug represents a classic distributed systems challenge: **coordinating 
independent lifecycles** (offset commits vs log retention) with **partial 
information** (epoch metadata). The fix moves the responsibility to the commit 
path, using fresher epoch information that's less likely to be 
garbage-collected.

**Priority**: **High** - Should be fixed in next patch release

**Risk**: **Low** - One-line change, clear semantics, testable

**Impact**: **High** - Prevents consumer reset storms and data loss scenarios

---

## References

- JIRA: KAFKA-19902 (recently filed)
- Related KIPs: KIP-320 (epoch fencing), KIP-74 (fetch protocol v4)
- Code locations:
- Bug: 
[`SubscriptionState.java:779-780`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java:779)
- Epoch source: 
[`CompletedFetch.java:200`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/CompletedFetch.java:200)
- Position update: 
[`FetchCollector.java:180-183`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/FetchCollector.java:180)
- Validation: 
[`OffsetsForLeaderEpochUtils.java:53-61`](clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetsForLeaderEpochUtils.java:53)

---

*Analysis conducted: 2025-11-28* 
*Kafka codebase version: trunk (latest)* 
*Analysis methodology: Code tracing + semantic search + JIRA research*



h3. Primary Issues
 # *KAFKA-19902* ✅ *EXACT MATCH* - The main bug we investigated

 ** Consumer commits offset with stale epoch after leader change
 ** OFFSET_OUT_OF_RANGE when epoch metadata deleted by retention
 # *KAFKA-16248* ✅ *RELATED VARIANT* - Same bug in Kafka Streams

 ** Kafka Streams cannot populate leader epoch correctly in EOS mode
 ** Same root cause, different component

h3. Foundational Issues (Historical Context)
 # *KAFKA-6880* ✅ *FOUNDATIONAL* - Original zombie replica problem (2018)

 ** The problem that motivated KIP-320 epoch fencing
 ** KAFKA-19902 is an unintended consequence of the solution
 # *KAFKA-7395* ✅ *IMPLEMENTATION* - KIP-320 broker-side implementation (2018)

 ** Added epoch validation to replication protocol
 ** Introduced OffsetsForLeaderEpoch API that KAFKA-19902 bug affects

h3. Tangentially Related
 # *KAFKA-19633* ⚠️ *TANGENTIAL* - Kafka Connect zombie records during rebalance
 ** Related to zombie consumer problem domain
 ** NOT the same bug (different issue about ordering during rebalance)

h2. The Bug Family Tree
 {{KAFKA-6880 (2018): Zombie replicas    ↓KIP-320: Epoch fencing solution    
↓KAFKA-7395 (2018): Implementation    ↓Bug introduced: allConsumed() uses wrong 
epoch    ↓KAFKA-16248 (2023): Streams variant    ↓KAFKA-19902 (2024): Consumer 
epoch bug discovered}}
{*}Key Finding{*}: KAFKA-19902 is an *unintended consequence* of the KIP-320 
epoch fencing mechanism that was designed to solve the zombie replica problem 
(KAFKA-6880). The solution introduced epoch validation, but the consumer 
implementation commits offsets with the wrong epoch ({{{}offsetEpoch{}}} 
instead of {{{}currentLeader.epoch{}}}), creating a new failure mode when 
retention deletes epoch metadata.

> Consumer triggers OFFSET_OUT_OF_RANGE when committed offset uses stale epoch 
> after leader change
> ------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-19902
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19902
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>    Affects Versions: 3.9.0
>            Reporter: RivenSun
>            Priority: Major
>         Attachments: image-2025-11-21-18-03-24-592.png, 
> image-2025-11-21-18-11-24-316.png, image-2025-11-21-20-00-10-862.png
>
>
> h2. Summary
> When a partition leader changes and the consumer commits offsets with a stale 
> epoch, if the log segments containing that epoch are subsequently deleted due 
> to retention policy, the consumer will encounter OFFSET_OUT_OF_RANGE error 
> and reset to earliest (if auto.offset.reset=earliest), causing massive 
> message reprocessing.The root cause is that SubscriptionState.allConsumed() 
> uses position.offsetEpoch instead of position.currentLeader.epoch when 
> constructing OffsetAndMetadata for commit, which can become stale when leader 
> changes occur.
> If `auto.offset.reset=latest`, *the consequences will be more severe, 
> resulting in message loss.*
> ----
> h2. Environment
> Cluster Configuration:
>  * Kafka Server Version: 3.9.0
>  * Kafka Client Version: 3.9.0
>  * Topic: 200 partitions, 7-day retention, no tiered storage
>  * Consumer Group: 45 consumers (1 KafkaConsumer thread per machine)
>  * No broker/controller restarts occurred
>  * High throughput producer continuously writing messages
> Consumer Configuration:
> {code:java}
> auto.offset.reset=earliest
> enable.auto.commit=true {code}
>  
> Consumer Code:
>  * Registered ConsumerRebalanceListener
>  * Calls kafkaConsumer.commitSync() in onPartitionsRevoked() method
> ----
> h2. Problem Description
> In a scenario where the consumer group has no lag, consumers suddenly 
> consumed a massive amount of messages, far exceeding the recent few minutes 
> of producer writes. Investigation revealed that multiple partitions reset to 
> the earliest offset and reprocessed up to 7 days of historical data.
> ----
> h2. Observed Symptoms (Timeline)
>  # Consumer group rebalance occurred (triggered by normal consumer group 
> management)
>  # Consumer logged OFFSET_OUT_OF_RANGE errors immediately after rebalance
>  # Consumer reset to earliest offset due to auto.offset.reset=earliest 
> configuration
>  # Kafka broker logged NotLeaderOrFollowerException around the same 
> timeframe, indicating partition leader changes
>  # Consumer did not log any NOT_LEADER_OR_FOLLOWER errors (these are DEBUG 
> level and not visible in production logs)
> The image below uses the partition 
> asyncmq_local_us_us_marketplace-ssl-a_aws_us-east-1_2a7e053c-9d90-4efd-af2d-3a8bf9564715-153
>  as an example to trace the problem log chain. {*}See attached log screenshot 
> at the bottom{*}.
> ----
> h2. Root Cause Analysis
> h3. The Problem Chain
> 1. Leader change occurs (epoch changes from N to N+1)
>    ↓
> 2. Consumer continues processing old batches (epoch=N)
>    ↓
> 3. Consumer commits offset during/after rebalance
>    ├─ Committed offset: 1000
>    └─ Committed epoch: N (using position.offsetEpoch from old batch)
>    ↓
> 4. High throughput + retention policy causes old segments (epoch=N) to be 
> deleted
>    ↓
> 5. Consumer restarts/rebalances and fetches committed offset
>    ├─ Tries to validate offset 1000 with epoch=N
>    └─ Broker cannot find epoch=N ({*}segments deleted or tp leader change{*})
>    ↓
> 6. Broker returns OFFSET_OUT_OF_RANGE
>    ↓
> 7. Consumer resets to earliest offset
> h3. Code Analysis
> The problematic code in SubscriptionState.allConsumed():
> {code:java}
> // 
> kafka/clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java
> public synchronized Map<TopicPartition, OffsetAndMetadata> allConsumed() {
>     Map<TopicPartition, OffsetAndMetadata> allConsumed = new HashMap<>();
>     assignment.forEach((topicPartition, partitionState) -> {
>         if (partitionState.hasValidPosition())
>             allConsumed.put(topicPartition, new OffsetAndMetadata(
>                 partitionState.position.offset,
>                 partitionState.position.offsetEpoch,  // Problem: uses 
> offsetEpoch from consumed batch
>                 ""));
>     });
>     return allConsumed;
> } {code}
>  
> Why this is problematic:The FetchPosition class contains two different epoch 
> values:
>  * offsetEpoch: The epoch from the last consumed record's batch
>  * currentLeader.epoch: The current partition leader's epoch from metadata
> When committing offsets, we should use currentLeader.epoch instead of 
> offsetEpoch because:
>  # offsetEpoch represents the epoch of already consumed data (historical)
>  # currentLeader.epoch represents the current partition leader (up-to-date)
> h3. Scenarios Where These Epochs Diverge
> Scenario A: Leader changes while consumer is processing old batches
>  * T1: Consumer fetches batch with epoch=5
>  * T2: Leader changes to epoch=6
>  * T3: Metadata updates with new leader epoch=6
>  * T4: Consumer commits offset
>  * offsetEpoch = 5 (from batch being processed)
>  * currentLeader.epoch = 6 (from updated metadata)
>  * Problem: Commits epoch=5, which may soon be deleted
> Scenario B: Recovery from committed offset after leader change
>  * Consumer commits offset with old epoch=N
>  * Leader changes to epoch=N+1
>  * Old segments (epoch=N) are deleted by retention policy
>  * Consumer rebalances and tries to restore from committed offset
>  * offsetEpoch = N (from committed offset)
>  * currentLeader.epoch = N+1 (from current metadata)
>  * Problem: Validation fails because epoch=N no longer exists
> ----
> h2. Steps to Reproduce
> This is a timing-sensitive edge case. The following conditions increase the 
> likelihood:
>  # Setup:
>  * High-throughput topic (to trigger faster log rotation)
>  * Relatively short retention period (e.g., 7 days)
>  * Consumer group with rebalance listener calling commitSync()
>  * enable.auto.commit=true (or any manual commit)
>  # Trigger:
>  * Trigger a partition leader change (broker restart, controller election, 
> etc.)
>  * Simultaneously or shortly after, trigger a consumer group rebalance
>  * Wait for retention policy to delete old log segments
>  # Expected Result:
> Consumer should resume from committed offset
>  # Actual Result:
> Consumer encounters OFFSET_OUT_OF_RANGE and resets to earliest
> ----
> h2. Impact
>  * Data Reprocessing: Consumers may reprocess up to retention.ms worth of data
>  * Service Degradation: Sudden spike in consumer throughput can overwhelm 
> downstream systems
>  * Resource Waste: Unnecessary CPU, memory, and network usage
>  * Potential Duplicates: If using auto.offset.reset=earliest, duplicate 
> message processing is guaranteed
>  * If `auto.offset.reset=latest`, *the consequences will be more severe, 
> resulting in message loss.*
> ----
> h2. Proposed Fix
> h3. Root Cause Analysis
> The issue is more fundamental than a simple field selection problem. The core 
> issue is that both epoch values in FetchPosition can be stale at commit time:
>  # offsetEpoch: Contains the epoch from the last consumed record's batch. If 
> a leader change occurs after consumption but before commit, this epoch 
> becomes stale and may reference log segments that have been deleted.
>  # currentLeader.epoch: Inherited from the previous position during normal 
> consumption and only updated when:
>  * NOT_LEADER_OR_FOLLOWER or FENCED_LEADER_EPOCH errors are detected
>  * Position is restored from committed offsets (fetches from metadata)
>  * Explicit validation is triggered via 
> maybeValidatePositionForCurrentLeader()
> During normal, error-free consumption, currentLeader is never updated and can 
> also become stale.
> h3. Problem with Current Code
> Location: org.apache.kafka.clients.consumer.internals.FetchCollector 
> {code:java}
> if (nextInLineFetch.nextFetchOffset() > position.offset) {
>     SubscriptionState.FetchPosition nextPosition = new 
> SubscriptionState.FetchPosition(
>             nextInLineFetch.nextFetchOffset(),
>             nextInLineFetch.lastEpoch(),    // offsetEpoch: from consumed 
> batch
>             position.currentLeader);         // ❌ currentLeader: inherited, 
> NOT updated!
>     log.trace("Updating fetch position from {} to {} for partition {} and 
> returning {} records from `poll()`",
>             position, nextPosition, tp, partRecords.size());
>     subscriptions.position(tp, nextPosition);
>     positionAdvanced = true;
> }
> {code}
> The inherited currentLeader means it can be as stale as offsetEpoch in 
> certain scenarios.
> ----
> h3. Recommended Solution: Proactively Update currentLeader During Position 
> Updates
> Option 1: Update currentLeader when advancing position (Primary 
> Recommendation)Modify FetchCollector to fetch the latest leader information 
> from metadata every time the position is updated:
> {code:java}
> if (nextInLineFetch.nextFetchOffset() > position.offset) {
>     // Fetch the latest leader information from metadata
>     Metadata.LeaderAndEpoch currentLeaderAndEpoch = 
> metadata.currentLeader(tp);
>     
>     SubscriptionState.FetchPosition nextPosition = new 
> SubscriptionState.FetchPosition(
>             nextInLineFetch.nextFetchOffset(),
>             nextInLineFetch.lastEpoch(),
>             currentLeaderAndEpoch);  // ✅ Use fresh leader info from metadata
>     
>     log.trace("Updating fetch position from {} to {} for partition {} and 
> returning {} records from `poll()`",
>             position, nextPosition, tp, partRecords.size());
>     subscriptions.position(tp, nextPosition);
>     positionAdvanced = true;
> } {code}
> Advantages:
>  * Ensures currentLeader is always up-to-date
>  * Makes allConsumed() safe to use *currentLeader.epoch* for commits
> Modify SubscriptionState.allConsumed() to {color:#de350b}use 
> currentLeader.epoch instead of offsetEpoch{color}:
> {code:java}
> public synchronized Map<TopicPartition, OffsetAndMetadata> allConsumed() {
>     Map<TopicPartition, OffsetAndMetadata> allConsumed = new HashMap<>();
>     assignment.forEach((topicPartition, partitionState) -> {
>         if (partitionState.hasValidPosition())
>             allConsumed.put(topicPartition, new OffsetAndMetadata(
>                 partitionState.position.offset,
>                 partitionState.position.currentLeader.epoch,  // ✅ Use 
> current leader epoch
>                 ""));
>     });
>     return allConsumed;
> } {code}
>  
>  * Minimal performance impact (metadata lookup is O(1) from local cache)
>  * Aligns with the existing pattern in refreshCommittedOffsets()
> Potential Concerns:
>  * Adds one metadata lookup per position update
>  * If metadata is stale, currentLeader.epoch could still lag slightly, but 
> this is the same risk as today
> ----
> h3. Alternative Solutions
> Option 2: Fetch fresh leader info during commitModify allConsumed() to fetch 
> the latest leader information at commit time:
> {code:java}
> // Note: This would require passing metadata reference to allConsumed()
> public synchronized Map<TopicPartition, OffsetAndMetadata> 
> allConsumed(ConsumerMetadata metadata) {
>     Map<TopicPartition, OffsetAndMetadata> allConsumed = new HashMap<>();
>     assignment.forEach((topicPartition, partitionState) -> {
>         if (partitionState.hasValidPosition()) {
>             // Fetch the latest leader epoch from metadata at commit time
>             Metadata.LeaderAndEpoch latestLeader = 
> metadata.currentLeader(topicPartition);
>             Optional<Integer> epochToCommit = latestLeader.epoch.isPresent() 
>                 ? latestLeader.epoch 
>                 : partitionState.position.offsetEpoch;  // Fallback to 
> offsetEpoch
>             
>             allConsumed.put(topicPartition, new OffsetAndMetadata(
>                 partitionState.position.offset,
>                 epochToCommit,
>                 ""));
>         }
>     });
>     return allConsumed;
> } {code}
> Advantages:
>  * Only impacts commit path, not consumption hot path
>  * Directly addresses the commit-time staleness issue
> Disadvantages:
>  * Requires changing the signature of allConsumed() (API change)
>  * May still have a race condition if leader changes between metadata fetch 
> and commit
>  * Metadata could be stale if update hasn't been processed yet
> ----
> Option 3: Use the maximum epoch valueUse the larger of the two epoch values, 
> assuming newer epochs have higher values:
> {code:java}
> public synchronized Map<TopicPartition, OffsetAndMetadata> allConsumed() {
>     Map<TopicPartition, OffsetAndMetadata> allConsumed = new HashMap<>();
>     assignment.forEach((topicPartition, partitionState) -> {
>         if (partitionState.hasValidPosition()) {
>             Optional<Integer> epochToCommit;
>             
>             if (partitionState.position.offsetEpoch.isPresent() && 
>                 partitionState.position.currentLeader.epoch.isPresent()) {
>                 // Use the maximum of the two epochs
>                 int maxEpoch = Math.max(
>                     partitionState.position.offsetEpoch.get(),
>                     partitionState.position.currentLeader.epoch.get());
>                 epochToCommit = Optional.of(maxEpoch);
>             } else {
>                 // Fallback to whichever is present
>                 epochToCommit = 
> partitionState.position.currentLeader.epoch.isPresent()
>                     ? partitionState.position.currentLeader.epoch
>                     : partitionState.position.offsetEpoch;
>             }
>             
>             allConsumed.put(topicPartition, new OffsetAndMetadata(
>                 partitionState.position.offset,
>                 epochToCommit,
>                 ""));
>         }
>     });
>     return allConsumed;
> } {code}
> Advantages:
>  * No API changes required
>  * Simple to implement
>  * Provides better protection than using only one epoch
> Disadvantages:
>  * Heuristic-based; assumes epochs are monotonically increasing
>  * Could still use a stale epoch if both values are old
>  * Doesn't solve the root cause of stale currentLeader
> ----
> h3. Recommendation
> Primary recommendation: Implement Option 1 (Update currentLeader during 
> position updates)This is the most robust solution because:
>  # It ensures currentLeader is always fresh
>  # It fixes the root cause rather than working around symptoms
>  # It has minimal performance impact
>  # It makes the codebase more consistent and maintainable
> Secondary recommendation: Implement Option 3 as a defense-in-depth 
> measureEven with Option 1, using max(offsetEpoch, currentLeader.epoch) in 
> allConsumed() provides additional safety against any edge cases where one 
> epoch might be more up-to-date than the other.Combined approach (strongest 
> protection):
> {code:java}
> // In FetchCollector.java
> if (nextInLineFetch.nextFetchOffset() > position.offset) {
>     Metadata.LeaderAndEpoch currentLeaderAndEpoch = 
> metadata.currentLeader(tp);
>     SubscriptionState.FetchPosition nextPosition = new 
> SubscriptionState.FetchPosition(
>             nextInLineFetch.nextFetchOffset(),
>             nextInLineFetch.lastEpoch(),
>             currentLeaderAndEpoch);  // ✅ Keep currentLeader fresh
>     subscriptions.position(tp, nextPosition);
>     positionAdvanced = true;
> }
>  
> // In SubscriptionState.java
> public synchronized Map<TopicPartition, OffsetAndMetadata> allConsumed() {
>     Map<TopicPartition, OffsetAndMetadata> allConsumed = new HashMap<>();
>     assignment.forEach((topicPartition, partitionState) -> {
>         if (partitionState.hasValidPosition()) {
>             // Use the maximum epoch as defense-in-depth
>             Optional<Integer> epochToCommit;
>             if (partitionState.position.offsetEpoch.isPresent() && 
>                 partitionState.position.currentLeader.epoch.isPresent()) {
>                 epochToCommit = Optional.of(Math.max(
>                     partitionState.position.offsetEpoch.get(),
>                     partitionState.position.currentLeader.epoch.get()));
>             } else {
>                 epochToCommit = 
> partitionState.position.currentLeader.epoch.isPresent()
>                     ? partitionState.position.currentLeader.epoch
>                     : partitionState.position.offsetEpoch;
>             }
>             
>             allConsumed.put(topicPartition, new OffsetAndMetadata(
>                 partitionState.position.offset,
>                 epochToCommit,
>                 ""));
>         }
>     });
>     return allConsumed;
> } {code}
> This combined approach provides:
>  * Prevention: Keep currentLeader fresh during normal operation
>  * Defense: Use the best available epoch value at commit time
>  * Resilience: Minimize the window where a stale epoch can cause issues
> ----
> h2. Additional Notes
> Why consumers don't log NOT_LEADER_OR_FOLLOWER errors:All consumer-side 
> handling of NOT_LEADER_OR_FOLLOWER errors uses DEBUG level logging:
> {code:java}
> // FetchCollector.java line 325
> log.debug("Error in fetch for partition {}: {}", tp, error.exceptionName());
>  
> // AbstractFetch.java line 207
> log.debug("For {}, received error {}, with leaderIdAndEpoch {}", partition, 
> partitionError, ...);
>  
> // OffsetsForLeaderEpochUtils.java line 102
> LOG.debug("Attempt to fetch offsets for partition {} failed due to {}, 
> retrying.", ...); {code}
>  
> This makes the issue difficult to diagnose in production environments.
> ----
> h2. Workarounds (Until Fixed)
>  # Increase retention period to reduce likelihood of epoch deletion
>  # Monitor consumer lag to ensure it stays low
>  # Reduce rebalance frequency (increase max.poll.interval.ms, 
> session.timeout.ms)
>  # Use cooperative rebalance strategy to minimize rebalance impact
>  # Consider using auto.offset.reset=latest if reprocessing is more costly 
> than data loss (application-dependent)
> ----
> h2. Related Code References
> h3. 1. The problematic method: SubscriptionState.allConsumed()
> Location: org.apache.kafka.clients.consumer.internals.SubscriptionState 
> {code:java}
> public synchronized Map<TopicPartition, OffsetAndMetadata> allConsumed() {
>     Map<TopicPartition, OffsetAndMetadata> allConsumed = new HashMap<>();
>     assignment.forEach((topicPartition, partitionState) -> {
>         if (partitionState.hasValidPosition())
>             allConsumed.put(topicPartition, new 
> OffsetAndMetadata(partitionState.position.offset,
>                     partitionState.position.offsetEpoch, ""));  // Uses 
> offsetEpoch instead of currentLeader.epoch
>     });
>     return allConsumed;
> } {code}
> h3. 2. How FetchPosition is updated during normal consumption
> Location: org.apache.kafka.clients.consumer.internals.FetchCollector 
> {code:java}
> if (nextInLineFetch.nextFetchOffset() > position.offset) {
>     SubscriptionState.FetchPosition nextPosition = new 
> SubscriptionState.FetchPosition(
>             nextInLineFetch.nextFetchOffset(),
>             nextInLineFetch.lastEpoch(),    // offsetEpoch: from consumed 
> batch
>             position.currentLeader);         // currentLeader: inherited from 
> old position, NOT updated!
>     log.trace("Updating fetch position from {} to {} for partition {} and 
> returning {} records from `poll()`",
>             position, nextPosition, tp, partRecords.size());
>     subscriptions.position(tp, nextPosition);
>     positionAdvanced = true;
> } {code}
> Key Issue: The currentLeader field is inherited from the previous position 
> and not automatically updated during normal consumption. It only gets updated 
> when leader change errors are detected.
> h3. 3. How committed offsets are restored after rebalance
> Location: 
> org.apache.kafka.clients.consumer.internals.ConsumerUtils.refreshCommittedOffsets()
> {code:java}
> public static void refreshCommittedOffsets(final Map<TopicPartition, 
> OffsetAndMetadata> offsetsAndMetadata,
>                                            final ConsumerMetadata metadata,
>                                            final SubscriptionState 
> subscriptions) {
>     for (final Map.Entry<TopicPartition, OffsetAndMetadata> entry : 
> offsetsAndMetadata.entrySet()) {
>         final TopicPartition tp = entry.getKey();
>         final OffsetAndMetadata offsetAndMetadata = entry.getValue();
>         if (offsetAndMetadata != null) {
>             // first update the epoch if necessary
>             entry.getValue().leaderEpoch().ifPresent(epoch -> 
> metadata.updateLastSeenEpochIfNewer(entry.getKey(), epoch));
>  
>             // it's possible that the partition is no longer assigned when 
> the response is received,
>             // so we need to ignore seeking if that's the case
>             if (subscriptions.isAssigned(tp)) {
>                 final ConsumerMetadata.LeaderAndEpoch leaderAndEpoch = 
> metadata.currentLeader(tp);
>                 final SubscriptionState.FetchPosition position = new 
> SubscriptionState.FetchPosition(
>                         offsetAndMetadata.offset(), 
>                         offsetAndMetadata.leaderEpoch(),  // offsetEpoch from 
> committed offset (may be old)
>                         leaderAndEpoch);                   // currentLeader 
> from current metadata (may be new)
>  
>                 subscriptions.seekUnvalidated(tp, position);
>  
>                 log.info("Setting offset for partition {} to the committed 
> offset {}", tp, position);
>             }
>         }
>     }
> } {code}
> The Divergence Point: When restoring from committed offsets, offsetEpoch 
> comes from the stored offset (potentially old), while currentLeader comes 
> from fresh metadata (potentially new after leader change).
> h3. 4. How OffsetsForLeaderEpoch validation request is constructed
> Location: 
> org.apache.kafka.clients.consumer.internals.OffsetsForLeaderEpochUtils.prepareRequest()
> {code:java}
> static AbstractRequest.Builder<OffsetsForLeaderEpochRequest> prepareRequest(
>         Map<TopicPartition, SubscriptionState.FetchPosition> requestData) {
>     OffsetForLeaderTopicCollection topics = new 
> OffsetForLeaderTopicCollection(requestData.size());
>     requestData.forEach((topicPartition, fetchPosition) ->
>             fetchPosition.offsetEpoch.ifPresent(fetchEpoch -> {
>                 OffsetForLeaderTopic topic = 
> topics.find(topicPartition.topic());
>                 if (topic == null) {
>                     topic = new 
> OffsetForLeaderTopic().setTopic(topicPartition.topic());
>                     topics.add(topic);
>                 }
>                 topic.partitions().add(new OffsetForLeaderPartition()
>                         .setPartition(topicPartition.partition())
>                         .setLeaderEpoch(fetchEpoch)              // Uses 
> offsetEpoch for validation
>                         
> .setCurrentLeaderEpoch(fetchPosition.currentLeader.epoch
>                                 
> .orElse(RecordBatch.NO_PARTITION_LEADER_EPOCH))
>                 );
>             })
>     );
>     return OffsetsForLeaderEpochRequest.Builder.forConsumer(topics);
> } {code}
> The Validation Problem: The validation request uses fetchEpoch (which is 
> offsetEpoch) to validate against the broker. If this epoch no longer exists 
> in the broker's log, validation fails and triggers OFFSET_OUT_OF_RANGE.
> h3. 5. FetchPosition class definition
> Location: 
> org.apache.kafka.clients.consumer.internals.SubscriptionState.FetchPosition 
>  
> {code:java}
> /**
>  * Represents the position of a partition subscription.
>  *
>  * This includes the offset and epoch from the last record in
>  * the batch from a FetchResponse. It also includes the leader epoch at the 
> time the batch was consumed.
>  */
> public static class FetchPosition {
>     public final long offset;
>     final Optional<Integer> offsetEpoch;        // Epoch from last consumed 
> record's batch
>     final Metadata.LeaderAndEpoch currentLeader; // Current partition leader 
> info from metadata
>  
>     FetchPosition(long offset) {
>         this(offset, Optional.empty(), 
> Metadata.LeaderAndEpoch.noLeaderOrEpoch());
>     }
>  
>     public FetchPosition(long offset, Optional<Integer> offsetEpoch, 
> Metadata.LeaderAndEpoch currentLeader) {
>         this.offset = offset;
>         this.offsetEpoch = Objects.requireNonNull(offsetEpoch);
>         this.currentLeader = Objects.requireNonNull(currentLeader);
>     }
>  
>     @Override
>     public String toString() {
>         return "FetchPosition{" +
>                 "offset=" + offset +
>                 ", offsetEpoch=" + offsetEpoch +
>                 ", currentLeader=" + currentLeader +
>                 '}';
>     }
> }{code}
> Class Design: The class contains both offsetEpoch (historical data epoch) and 
> currentLeader.epoch (current metadata epoch), but allConsumed() only uses the 
> former when committing.
> h1. Attachements logs:
> h2. group rebalance log
> !image-2025-11-21-18-03-24-592.png!
> h2. consumer client log
> !image-2025-11-21-20-00-10-862.png!
> h2. broker server log
> !image-2025-11-21-18-11-24-316.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-19902) Consumer triggers OFFSET_OUT_OF_RANGE when committed offset uses stale epoch after leader change

Reply via email to