[ 
https://issues.apache.org/jira/browse/HUDI-6155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-6155:
----------------------------
    Fix Version/s: 0.14.0

> Fix cleaner based on hours for earliest commit to retain
> --------------------------------------------------------
>
>                 Key: HUDI-6155
>                 URL: https://issues.apache.org/jira/browse/HUDI-6155
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: cleaning
>            Reporter: sivabalan narayanan
>            Priority: Major
>             Fix For: 0.14.0
>
>
> When cleaner is based on hours, we estimate the earliest commit to retain 
> based on current time zone and not UTC or the timezone used to generate the 
> commit time. so, there could be some mis-calculations and lead to deleting 
> additional slices. 
>  
> Ref: 
> [https://github.com/apache/hudi/blob/c6760772f8dc62eb44c45b022ed07858d895d804/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java#L511]
>  
> {code:java}
> else if (config.getCleanerPolicy() == 
> HoodieCleaningPolicy.KEEP_LATEST_BY_HOURS) {
>   Instant instant = Instant.now();
>   ZonedDateTime currentDateTime = ZonedDateTime.ofInstant(instant, 
> ZoneId.systemDefault());
>   String earliestTimeToRetain = 
> HoodieActiveTimeline.formatDate(Date.from(currentDateTime.minusHours(hoursRetained).toInstant()));
>   earliestCommitToRetain = 
> Option.fromJavaOptional(commitTimeline.getInstantsAsStream().filter(i -> 
> HoodieTimeline.compareTimestamps(i.getTimestamp(),
>           HoodieTimeline.GREATER_THAN_OR_EQUALS, 
> earliestTimeToRetain)).findFirst());
> } {code}
>  
>  
> Potential fixes:
> - Fix the time based on time zone set in table config. 
> - Fetch the latest completed commit and decide the earliest commit based on 
> that.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to