hussein-awala opened a new issue, #6953:
URL: https://github.com/apache/hudi/issues/6953

   **Describe the problem you faced**
   
   As I understood, the 
[CleanPlanner](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java)
 prepares a list of files to delete by listing the files on each partition, and 
checking if there are some files to delete based on the used cleaner policy. If 
the incremental cleaner mode is enabled, and there is an old clean operation 
metadata present in the timeline, it read starting instant of the previous 
clean for the avro file, and check only [the partitions that have change since 
the last 
clean](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L171).
   
   But if after listing all the files in the partitions (brute force in the 
first time, or the partitions that have change sine a previous clean) there is 
no file to delete, the CleanPlanActionExecutor [will not create a clean 
request](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java#L149-L156)
 avro file, then the state will not be transitioned to inflight or complete by 
the 
[CleanActionExecutor](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java#L195).
 In the next clean, we will check the same partitions which we already checked 
in this clean even if they haven't change since this clean, and we will not 
take advantage of the incremental cleaner mode feature.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Activate the incremental cleaner mode, enable the clean and use 
KEEP_LATEST_COMMIT policy with 1 retain commit
   2. write data to 10 different partitions in 10 different commits
   3. check if there is clean avro metadata file created in the timeline
   4. compare the time of the operation listing files to delete between the 
different commits
   
   **Expected behavior**
   
   A `.clean` avro metadata file should be created and added to the timeline 
with the start clean time in order to use it in the next clean to avoid 
re-checking all the partitions.
   
   **Environment Description**
   
   * Hudi version : 0.12.0
   
   * Spark version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   **Additional context**
   
   I already checked the new PRs which improve the cleaning 
([PR1](https://github.com/apache/hudi/pull/6890) and 
[PR2](https://github.com/apache/hudi/pull/6548)) but they don't solve this 
problem.
   I'm willing to submit a PR to fix the problem when it is confirmed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to