ZuebeyirEser commented on issue #2526:
URL: https://github.com/apache/fluss/issues/2526#issuecomment-3836596606
Hi, I have successfully reproduced this issue and identified the root cause.
### Root Cause
The issue is caused by the `ReplicaManager` incorrectly deleting remote
snapshots during a local replica stop operation (which occurs during
rebalancing).
I traced the log `Delete table's remote bucket snapshot dir... success` to
`ReplicaManager#stopReplica`. The code logic explicitly deletes the remote
snapshot if `delete=true` and the node is the leader:
```java
// Original code in ReplicaManager.java
remoteLogManager.stopReplica(replicaToDelete, delete &&
replicaToDelete.isLeader());
if (delete && replicaToDelete.isLeader()) {
// This incorrectly deletes shared remote files during local cleanup
kvManager.deleteRemoteKvSnapshot(
replicaToDelete.getPhysicalTablePath(),
replicaToDelete.getTableBucket());
}
```
When a rebalance occurs, the primary key table leader is stopped on the old
node with `delete=true` to clean up local resources. However, the code above
was also deleting the shared remote snapshot, causing the new leader (on a
different node) to fail with `KvSnapshotNotExistException` when trying to load
that data.
### Fix
I have verified that removing the kvManager.deleteRemoteKvSnapshot(...) call
fixes the issue. Remote snapshots should only be deleted when the table is
explicitly dropped, not during replica migration.
I have written a regression test KvSnapshotDeletionBugReplicationTest that
verifies that dropKv (used during rebalance) now correctly cleans up local
files while preserving the remote snapshot.
Could you please assign this issue to me, I will submit a PR with the fix
shortly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]