[ 
https://issues.apache.org/jira/browse/HUDI-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309273#comment-17309273
 ] 

Raymond Xu commented on HUDI-1723:
----------------------------------

[~vinoth], as we discussed, a quick fix is to ignore source limit and keeps 
reading until modification date changes. This could still be an issue in cases 
where lots of data files were written in a short timeframe. We saw S3's last 
modified date is up to seconds, so it's likely to contain many files with 
duplicate mod date. Breaking the source limit in that case could lead to OOM. I 
agree the quick fix would work in most cases. Do you think it's worth working 
on a more reliable checkpointing logic like mod date + filename? 

 

> DFSPathSelector skips files with the same modify date when read up to source 
> limit
> ----------------------------------------------------------------------------------
>
>                 Key: HUDI-1723
>                 URL: https://issues.apache.org/jira/browse/HUDI-1723
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: DeltaStreamer
>            Reporter: Raymond Xu
>            Priority: Critical
>             Fix For: 0.9.0
>
>         Attachments: Screen Shot 2021-03-26 at 1.42.42 AM.png
>
>
> org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles 
> filters the input files based on last saved checkpoint, which was the 
> modification date from last read file. However, the last read file's 
> modification date could be duplicated for multiple files and resulted in 
> skipping a few of them when reading up to source limit. An illustration is 
> shown in the attached picture.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to