[ 
https://issues.apache.org/jira/browse/HUDI-4515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zxy updated HUDI-4515:
----------------------
    Description: 
When I tested the behavior of clean and savepoint, I found that when clean is 
keeping latest versions, the files of savepoint will be deleted. By reading the 
code, I found that this should be a bug

 

For example, if I set the clean keepversion to 2, I do the following:
1. insert, get xxxx_001.parquet
2. savepoint
3. insert, get xxxx_002.parquet
4. insert, get xxxx_003.parquet
After the fourth step, the xxxx_001.parquet will be deleted even if it belongs 
to savepoint

 

here is: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java

getFilesToCleanKeepingLatestVersions
 * According to the following code, on the one hand, the checkpoints belonging 
to keepversion will be skipped and will not be counted in the calculation of 
keepversion, which I feel is unreasonable.
 * On the other hand, if there is a checkpoint in the remaining version of the 
files, it will be deleted, which I don't think is in line with the design 
philosophy of savepoints.

!image-2022-08-01-16-48-16-901.png|width=572,height=446!

So I think the judgment logic of the checkpoint should be moved down, if can be 
fixed by this:

 
{code:java}
while (fileSliceIterator.hasNext() && keepVersions > 0) {
// Skip this most recent version
fileSliceIterator.next();
keepVersions--;
}
// Delete the remaining files
while (fileSliceIterator.hasNext()) {
FileSlice nextSlice = fileSliceIterator.next();
Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
if (dataFile.isPresent() && 
savepointedFiles.contains(dataFile.get().getFileName())) {
    // do not clean up a savepoint data file
    continue;
}
deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
}{code}
 

 

  was:
When I tested the behavior of clean and savepoint, I found that when clean is 
keeping latest versions, the files of savepoint will be deleted. By reading the 
code, I found that this should be a bug

here is

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java

getFilesToCleanKeepingLatestVersions

!image-2022-08-01-16-48-16-901.png|width=572,height=446!

if can be fixed by this:

 
{code:java}
while (fileSliceIterator.hasNext() && keepVersions > 0) {
// Skip this most recent version
fileSliceIterator.next();
keepVersions--;
}
// Delete the remaining files
while (fileSliceIterator.hasNext()) {
FileSlice nextSlice = fileSliceIterator.next();
Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
if (dataFile.isPresent() && 
savepointedFiles.contains(dataFile.get().getFileName())) {
    // do not clean up a savepoint data file
    continue;
}
deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
}{code}
 

 


> savepoints will be clean in keeping latest versions policy
> ----------------------------------------------------------
>
>                 Key: HUDI-4515
>                 URL: https://issues.apache.org/jira/browse/HUDI-4515
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: cleaning
>    Affects Versions: 0.11.1
>            Reporter: zxy
>            Priority: Blocker
>              Labels: bug, clean, pull-request-available
>         Attachments: image-2022-08-01-16-48-16-901.png
>
>
> When I tested the behavior of clean and savepoint, I found that when clean is 
> keeping latest versions, the files of savepoint will be deleted. By reading 
> the code, I found that this should be a bug
>  
> For example, if I set the clean keepversion to 2, I do the following:
> 1. insert, get xxxx_001.parquet
> 2. savepoint
> 3. insert, get xxxx_002.parquet
> 4. insert, get xxxx_003.parquet
> After the fourth step, the xxxx_001.parquet will be deleted even if it 
> belongs to savepoint
>  
> here is: 
> hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
> getFilesToCleanKeepingLatestVersions
>  * According to the following code, on the one hand, the checkpoints 
> belonging to keepversion will be skipped and will not be counted in the 
> calculation of keepversion, which I feel is unreasonable.
>  * On the other hand, if there is a checkpoint in the remaining version of 
> the files, it will be deleted, which I don't think is in line with the design 
> philosophy of savepoints.
> !image-2022-08-01-16-48-16-901.png|width=572,height=446!
> So I think the judgment logic of the checkpoint should be moved down, if can 
> be fixed by this:
>  
> {code:java}
> while (fileSliceIterator.hasNext() && keepVersions > 0) {
> // Skip this most recent version
> fileSliceIterator.next();
> keepVersions--;
> }
> // Delete the remaining files
> while (fileSliceIterator.hasNext()) {
> FileSlice nextSlice = fileSliceIterator.next();
> Option<HoodieBaseFile> dataFile = nextSlice.getBaseFile();
> if (dataFile.isPresent() && 
> savepointedFiles.contains(dataFile.get().getFileName())) {
>     // do not clean up a savepoint data file
>     continue;
> }
> deletePaths.addAll(getCleanFileInfoForSlice(nextSlice));
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to