sivabalan narayanan created HUDI-6155:
-----------------------------------------

             Summary: Fix cleaner based on hours for earliest commit to retain
                 Key: HUDI-6155
                 URL: https://issues.apache.org/jira/browse/HUDI-6155
             Project: Apache Hudi
          Issue Type: Bug
          Components: cleaning
            Reporter: sivabalan narayanan


When cleaner is based on hours, we estimate the earliest commit to retain based 
on current time zone and not UTC or the timezone used to generate the commit 
time. so, there could be some mis-calculations and lead to deleting additional 
slices. 

 

Ref: 
[https://github.com/apache/hudi/blob/c6760772f8dc62eb44c45b022ed07858d895d804/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L511]

 
{code:java}
else if (config.getCleanerPolicy() == 
HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
  Instant instant = Instant.now();
  ZonedDateTime currentDateTime = ZonedDateTime.ofInstant(instant, 
ZoneId.systemDefault());
  String earliestTimeToRetain = 
HoodieActiveTimeline.formatDate(Date.from(currentDateTime.minusHours(hoursRetained).toInstant()));
  earliestCommitToRetain = 
Option.fromJavaOptional(commitTimeline.getInstantsAsStream().filter(i -> 
HoodieTimeline.compareTimestamps(i.getTimestamp(),
          HoodieTimeline.GREATER_THAN_OR_EQUALS, 
earliestTimeToRetain)).findFirst());
} {code}
 

 

Potential fixes:

- Fix the time based on time zone set in table config. 

- Fetch the latest completed commit and decide the earliest commit based on 
that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to