sivabalan narayanan created HUDI-6155:
-----------------------------------------
Summary: Fix cleaner based on hours for earliest commit to retain
Key: HUDI-6155
URL: https://issues.apache.org/jira/browse/HUDI-6155
Project: Apache Hudi
Issue Type: Bug
Components: cleaning
Reporter: sivabalan narayanan
When cleaner is based on hours, we estimate the earliest commit to retain based
on current time zone and not UTC or the timezone used to generate the commit
time. so, there could be some mis-calculations and lead to deleting additional
slices.
Ref:
[https://github.com/apache/hudi/blob/c6760772f8dc62eb44c45b022ed07858d895d804/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L511]
{code:java}
else if (config.getCleanerPolicy() ==
HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
Instant instant = Instant.now();
ZonedDateTime currentDateTime = ZonedDateTime.ofInstant(instant,
ZoneId.systemDefault());
String earliestTimeToRetain =
HoodieActiveTimeline.formatDate(Date.from(currentDateTime.minusHours(hoursRetained).toInstant()));
earliestCommitToRetain =
Option.fromJavaOptional(commitTimeline.getInstantsAsStream().filter(i ->
HoodieTimeline.compareTimestamps(i.getTimestamp(),
HoodieTimeline.GREATER_THAN_OR_EQUALS,
earliestTimeToRetain)).findFirst());
} {code}
Potential fixes:
- Fix the time based on time zone set in table config.
- Fetch the latest completed commit and decide the earliest commit based on
that.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)