[
https://issues.apache.org/jira/browse/HUDI-4515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zxy updated HUDI-4515:
----------------------
Description:
When I tested the behavior of clean and savepoint, I found that when clean is
keeping latest versions, the files of savepoint will be deleted. By reading the
code, I found that this should be a bug
For example, if I use "HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS", and set
the “hoodie.cleaner.fileversions.retained” to 2, I do the following:
1. insert, get xxxx_001.parquet
2. savepoint
3. insert, get xxxx_002.parquet
4. insert, get xxxx_003.parquet
After the fourth step, the xxxx_001.parquet will be deleted even if it belongs
to savepoint !
here is:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
getFilesToCleanKeepingLatestVersions
* According to the following code, on the one hand, the checkpoints belonging
to keepversion will be skipped and will not be counted in the calculation of
keepversion, which I feel is unreasonable.
* On the other hand, if there is a checkpoint in the remaining version of the
files, it will be deleted, which I don't think is in line with the design
philosophy of savepoints.
{code:java}
while (fileSliceIterator.hasNext() && keepVersions > 0) {
// Skip this most recent version
FileSlice nextSlice = fileSliceIterator.next();
Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
if (dataFile.isPresent() &&
savepointedFiles.contains(dataFile.get().getFileName())) {
// do not clean up a savepoint data file
continue;
}
keepVersions--;
}
// Delete the remaining files
while (fileSliceIterator.hasNext()) {
FileSlice nextSlice = fileSliceIterator.next();
deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
}{code}
So I think the judgment logic of the checkpoint should be moved down, if can be
fixed by this:
{code:java}
while (fileSliceIterator.hasNext() && keepVersions > 0) {
// Skip this most recent version
fileSliceIterator.next();
keepVersions--;
}
// Delete the remaining files
while (fileSliceIterator.hasNext()) {
FileSlice nextSlice = fileSliceIterator.next();
Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
if (dataFile.isPresent() &&
savepointedFiles.contains(dataFile.get().getFileName())) {
// do not clean up a savepoint data file
continue;
}
deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
}{code}
Thanks.
was:
When I tested the behavior of clean and savepoint, I found that when clean is
keeping latest versions, the files of savepoint will be deleted. By reading the
code, I found that this should be a bug
For example, if I set the clean keepversion to 2, I do the following:
1. insert, get xxxx_001.parquet
2. savepoint
3. insert, get xxxx_002.parquet
4. insert, get xxxx_003.parquet
After the fourth step, the xxxx_001.parquet will be deleted even if it belongs
to savepoint
here is:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
getFilesToCleanKeepingLatestVersions
* According to the following code, on the one hand, the checkpoints belonging
to keepversion will be skipped and will not be counted in the calculation of
keepversion, which I feel is unreasonable.
* On the other hand, if there is a checkpoint in the remaining version of the
files, it will be deleted, which I don't think is in line with the design
philosophy of savepoints.
!image-2022-08-01-16-48-16-901.png|width=572,height=446!
So I think the judgment logic of the checkpoint should be moved down, if can be
fixed by this:
{code:java}
while (fileSliceIterator.hasNext() && keepVersions > 0) {
// Skip this most recent version
fileSliceIterator.next();
keepVersions--;
}
// Delete the remaining files
while (fileSliceIterator.hasNext()) {
FileSlice nextSlice = fileSliceIterator.next();
Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
if (dataFile.isPresent() &&
savepointedFiles.contains(dataFile.get().getFileName())) {
// do not clean up a savepoint data file
continue;
}
deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
}{code}
> savepoints will be clean in keeping latest versions policy
> ----------------------------------------------------------
>
> Key: HUDI-4515
> URL: https://issues.apache.org/jira/browse/HUDI-4515
> Project: Apache Hudi
> Issue Type: Bug
> Components: cleaning
> Affects Versions: 0.11.1
> Reporter: zxy
> Priority: Blocker
> Labels: bug, clean, pull-request-available
> Attachments: image-2022-08-01-16-48-16-901.png
>
>
> When I tested the behavior of clean and savepoint, I found that when clean is
> keeping latest versions, the files of savepoint will be deleted. By reading
> the code, I found that this should be a bug
>
> For example, if I use "HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS", and
> set the “hoodie.cleaner.fileversions.retained” to 2, I do the following:
> 1. insert, get xxxx_001.parquet
> 2. savepoint
> 3. insert, get xxxx_002.parquet
> 4. insert, get xxxx_003.parquet
> After the fourth step, the xxxx_001.parquet will be deleted even if it
> belongs to savepoint !
>
> here is:
> hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
> getFilesToCleanKeepingLatestVersions
> * According to the following code, on the one hand, the checkpoints
> belonging to keepversion will be skipped and will not be counted in the
> calculation of keepversion, which I feel is unreasonable.
> * On the other hand, if there is a checkpoint in the remaining version of
> the files, it will be deleted, which I don't think is in line with the design
> philosophy of savepoints.
> {code:java}
> while (fileSliceIterator.hasNext() && keepVersions > 0) {
> // Skip this most recent version
> FileSlice nextSlice = fileSliceIterator.next();
> Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
> if (dataFile.isPresent() &&
> savepointedFiles.contains(dataFile.get().getFileName())) {
> // do not clean up a savepoint data file
> continue;
> }
> keepVersions--;
> }
> // Delete the remaining files
> while (fileSliceIterator.hasNext()) {
> FileSlice nextSlice = fileSliceIterator.next();
> deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
> }{code}
>
> So I think the judgment logic of the checkpoint should be moved down, if can
> be fixed by this:
> {code:java}
> while (fileSliceIterator.hasNext() && keepVersions > 0) {
> // Skip this most recent version
> fileSliceIterator.next();
> keepVersions--;
> }
> // Delete the remaining files
> while (fileSliceIterator.hasNext()) {
> FileSlice nextSlice = fileSliceIterator.next();
> Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
> if (dataFile.isPresent() &&
> savepointedFiles.contains(dataFile.get().getFileName())) {
> // do not clean up a savepoint data file
> continue;
> }
> deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
> }{code}
>
> Thanks.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)