[
https://issues.apache.org/jira/browse/KAFKA-15414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764143#comment-17764143
]
Francois Visconte commented on KAFKA-15414:
-------------------------------------------
Not sure it's the same issue happening again but I have a strange behaviour
while trying to reassign my partitions while consuming from the past (and
hitting tiered-storage).
It seems that at some point my consumer offset lag is going backward
!Screenshot 2023-09-12 at 13.53.07.png|width=1355,height=191!
And I have a burst of errors like on a handful of partitions (3 partitions out
of 32)
{code:java}
[ReplicaFetcher replicaId=10002, leaderId=10007, fetcherId=2] Error building
remote log auxiliary state for loadtest14-21
org.apache.kafka.server.log.remote.storage.RemoteStorageException: Couldn't
build the state from remote store for partition: loadtest14-21,
currentLeaderEpoch: 13, leaderLocalLogStartOffset: 81012034,
leaderLogStartOffset: 0, epoch: 12as the previous remote log segment metadata
was not found
at
kafka.server.ReplicaFetcherTierStateMachine.buildRemoteLogAuxState(ReplicaFetcherTierStateMachine.java:252)
at
kafka.server.ReplicaFetcherTierStateMachine.start(ReplicaFetcherTierStateMachine.java:102)
at
kafka.server.AbstractFetcherThread.handleOffsetsMovedToTieredStorage(AbstractFetcherThread.scala:761)
at
kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:412)
at scala.Option.foreach(Option.scala:437)
at
kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:332)
at
kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:331)
at
kafka.utils.Implicits$MapExtensionMethods$.$anonfun$forKeyValue$1(Implicits.scala:62)
at
scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry(JavaCollectionWrappers.scala:407)
at
scala.collection.convert.JavaCollectionWrappers$JMapWrapperLike.foreachEntry$(JavaCollectionWrappers.scala:403)
at
scala.collection.convert.JavaCollectionWrappers$AbstractJMapWrapper.foreachEntry(JavaCollectionWrappers.scala:321)
at
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:331)
at
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:130)
at
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:129)
at scala.Option.foreach(Option.scala:437)
at
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
at
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:112)
at kafka.server.ReplicaFetcherThread.doWork(ReplicaFetcherThread.scala:98)
at
org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:130)
{code}
> remote logs get deleted after partition reassignment
> ----------------------------------------------------
>
> Key: KAFKA-15414
> URL: https://issues.apache.org/jira/browse/KAFKA-15414
> Project: Kafka
> Issue Type: Bug
> Reporter: Luke Chen
> Assignee: Kamal Chandraprakash
> Priority: Blocker
> Fix For: 3.6.0
>
> Attachments: Screenshot 2023-09-12 at 13.53.07.png,
> image-2023-08-29-11-12-58-875.png
>
>
> it seems I'm reaching that codepath when running reassignments on my cluster
> and segment are deleted from remote store despite a huge retention (topic
> created a few hours ago with 1000h retention).
> It seems to happen consistently on some partitions when reassigning but not
> all partitions.
> My test:
> I have a test topic with 30 partition configured with 1000h global retention
> and 2 minutes local retention
> I have a load tester producing to all partitions evenly
> I have consumer load tester consuming that topic
> I regularly reset offsets to earliest on my consumer to test backfilling from
> tiered storage.
> My consumer was catching up consuming the backlog and I wanted to upscale my
> cluster to speed up recovery: I upscaled my cluster from 3 to 12 brokers and
> reassigned my test topic to all available brokers to have an even
> leader/follower count per broker.
> When I triggered the reassignment, the consumer lag dropped on some of my
> topic partitions:
> !image-2023-08-29-11-12-58-875.png|width=800,height=79! Screenshot 2023-08-28
> at 20 57 09
> Later I tried to reassign back my topic to 3 brokers and the issue happened
> again.
> Both times in my logs, I've seen a bunch of logs like:
> [RemoteLogManager=10005 partition=uR3O_hk3QRqsn4mPXGFoOw:loadtest11-17]
> Deleted remote log segment RemoteLogSegmentId
> {topicIdPartition=uR3O_hk3QRqsn4mPXGFoOw:loadtest11-17,
> id=Mk0chBQrTyKETTawIulQog}
> due to leader epoch cache truncation. Current earliest epoch:
> EpochEntry(epoch=14, startOffset=46776780), segmentEndOffset: 46437796 and
> segmentEpochs: [10]
> Looking at my s3 bucket. The segments prior to my reassignment have been
> indeed deleted.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)