hudi-bot opened a new issue, #15925: URL: https://github.com/apache/hudi/issues/15925
When cleaner is based on hours, we estimate the earliest commit to retain based on current time zone and not UTC or the timezone used to generate the commit time. so, there could be some mis-calculations and lead to deleting additional slices. Ref: [https://github.com/apache/hudi/blob/c6760772f8dc62eb44c45b022ed07858d895d804/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L511] {code:java} else if (config.getCleanerPolicy() == HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) { Instant instant = Instant.now(); ZonedDateTime currentDateTime = ZonedDateTime.ofInstant(instant, ZoneId.systemDefault()); String earliestTimeToRetain = HoodieActiveTimeline.formatDate(Date.from(currentDateTime.minusHours(hoursRetained).toInstant())); earliestCommitToRetain = Option.fromJavaOptional(commitTimeline.getInstantsAsStream().filter(i -> HoodieTimeline.compareTimestamps(i.getTimestamp(), HoodieTimeline.GREATER_THAN_OR_EQUALS, earliestTimeToRetain)).findFirst()); } {code} Potential fixes: - Fix the time based on time zone set in table config. - Fetch the latest completed commit and decide the earliest commit based on that. ## JIRA info - Link: https://issues.apache.org/jira/browse/HUDI-6155 - Type: Bug - Fix version(s): - 1.1.0 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
