hudeqi commented on code in PR #14652: URL: https://github.com/apache/kafka/pull/14652#discussion_r1398774988
########## core/src/main/java/kafka/server/ReplicaFetcherTierStateMachine.java: ########## @@ -238,12 +238,8 @@ private Long buildRemoteLogAuxState(TopicPartition topicPartition, log.debug("Updated the epoch cache from remote tier till offset: {} with size: {} for {}", leaderLocalLogStartOffset, epochs.size(), partition); - // Restore producer snapshot - File snapshotFile = LogFileUtils.producerSnapshotFile(unifiedLog.dir(), nextOffset); - buildProducerSnapshotFile(snapshotFile, remoteLogSegmentMetadata, rlm); - // Reload producer snapshots. - unifiedLog.producerStateManager().truncateFullyAndReloadSnapshots(); + truncateFullyAndReloadRestoredSnapshots(unifiedLog, nextOffset, remoteLogSegmentMetadata, rlm); Review Comment: > @hudeqi > > > In producerStateManager, first call truncateFullyAndStartAt to clean up the snapshot files, and then pull the snapshot file from RemoteLogManager. If the order of calls is reversed in original logic, it may cause the newly built snapshot file to be cleaned up again by truncateFullyAndStartAt. > > The `truncateFullyAndReloadSnapshots` only removes the snapshot files that was already loaded into the ProducerStateManager, so it doesn't remove the `snapshotFile` that was downloaded/built from the remote storage. > > AFAIR, we don't want to expose reloading the snapshots from ProducerStateManager without clearing it's internal states (`ProducerStateManager#reloadSnapshots` method). We can add a comment int this method to capture this information for future readers. Hi, @kamalcph , the reason why such changes are made in ReplicaFetcherTierStateMachine here is because of a relatively coincidental boundary case: if the name of the file pulled from the remote storage is the same as the name of the local file before truncate (that is, the file name pulled from the remote storage is in the keySet of the `snapshots` map), the snapshot file built first will be cleared by the later `truncateFullyAndStartAt`. I was adding the current unit test and found that the test was failed, so I discovered this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org