yungkei opened a new issue, #6479: URL: https://github.com/apache/paimon/issues/6479
### Search before asking - [x] I searched in the [issues](https://github.com/apache/paimon/issues) and found nothing similar. ### Paimon version I found this mistake in version 1.9.0, and it still exists in the master branch. ### Compute Engine flink version1.16, spark version 3.3.1 ### Minimal reproduce step If the baseManifestList or deltaManifestList associated with the tag are deleted in advance, the datafile will be deleted mistakenly during tag cleaning, which can cause data corruption, especially since the datafile is associated with the earliests snapshot. step1: delete baseManifestList or deltaManifestList associated with the tag, The premise is that the tag expiration time is greater than the snapshot expiration time step2: execute expired tag program step3: query the current snapshot or the earliest snapshot data, we will find a FileNotFoundException about the orc file ### What doesn't meet your expectations? This issue will result in datafile loss, and cause paimon unavailable. ### Anything else? When a tag expires, the left neighbor tag and the nearest right neighbor tag will be collected in skipping sets to prevent the datafile from being mistakenly deleted. if baseManifestList of the nearest right neighbor tag does not exist, the relevant datafiles will be accidentally deleted. So, I suggest the skipping set can collect both the left neighbor tag and the nearest right neighbor tag, along with the earliest snapshot. ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
