[
https://issues.apache.org/jira/browse/HUDI-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-6155:
---------------------------------
Labels: pull-request-available (was: )
> Fix cleaner based on hours for earliest commit to retain
> --------------------------------------------------------
>
> Key: HUDI-6155
> URL: https://issues.apache.org/jira/browse/HUDI-6155
> Project: Apache Hudi
> Issue Type: Bug
> Components: cleaning
> Reporter: sivabalan narayanan
> Priority: Critical
> Labels: pull-request-available
> Fix For: 0.14.0
>
>
> When cleaner is based on hours, we estimate the earliest commit to retain
> based on current time zone and not UTC or the timezone used to generate the
> commit time. so, there could be some mis-calculations and lead to deleting
> additional slices.
>
> Ref:
> [https://github.com/apache/hudi/blob/c6760772f8dc62eb44c45b022ed07858d895d804/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L511]
>
> {code:java}
> else if (config.getCleanerPolicy() ==
> HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
> Instant instant = Instant.now();
> ZonedDateTime currentDateTime = ZonedDateTime.ofInstant(instant,
> ZoneId.systemDefault());
> String earliestTimeToRetain =
> HoodieActiveTimeline.formatDate(Date.from(currentDateTime.minusHours(hoursRetained).toInstant()));
> earliestCommitToRetain =
> Option.fromJavaOptional(commitTimeline.getInstantsAsStream().filter(i ->
> HoodieTimeline.compareTimestamps(i.getTimestamp(),
> HoodieTimeline.GREATER_THAN_OR_EQUALS,
> earliestTimeToRetain)).findFirst());
> } {code}
>
>
> Potential fixes:
> - Fix the time based on time zone set in table config.
> - Fetch the latest completed commit and decide the earliest commit based on
> that.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)