Kyle Phelps created KAFKA-17249:
-----------------------------------
Summary: Failures when building remote log aux state can make the
leader epoch cache inconsistent
Key: KAFKA-17249
URL: https://issues.apache.org/jira/browse/KAFKA-17249
Project: Kafka
Issue Type: Bug
Components: Tiered-Storage
Affects Versions: 3.7.1, 3.8.0
Reporter: Kyle Phelps
When a follower has to `buildRemoteLogAuxState` it truncates the local log.
Then it attempts to rebuild the epoch cache from the checkpoint in remote
storage. However, if this fails and the broker is restarted, the cache is
missing entries associated with remote segments.
Reproduction steps:
# Take an existing tiered storage partition - move the latest index file from
remote storage so it will be inaccessible.
# Stop one of the follower brokers, delete the partition's local data.
# Restart the follower - it should be failing to build aux state.
# Restart the follower again. Since the log's offsets have been updated, it
can now successfully fetch and join the ISR.
# Promote the follower to the leader.
In this scenario the leader becomes unable to serve tiered fetch requests.
I _think_ the root of the problem here is that the leader epoch cache isn't
recovering the epoch data for remote segments.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)