[
https://issues.apache.org/jira/browse/HUDI-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314571#comment-17314571
]
Raymond Xu commented on HUDI-1723:
----------------------------------
[~shivnarayan] sure. I've actually encountered this issue in production. In S3,
the modification time is up to precision in seconds, which makes it very likely
to have multiple objects with the same modification time (e.g., in the case of
reprocessing large amount of historical data). So I think the severity is kind
of high.
> DFSPathSelector skips files with the same modify date when read up to source
> limit
> ----------------------------------------------------------------------------------
>
> Key: HUDI-1723
> URL: https://issues.apache.org/jira/browse/HUDI-1723
> Project: Apache Hudi
> Issue Type: Bug
> Components: DeltaStreamer
> Reporter: Raymond Xu
> Priority: Critical
> Labels: sev:critical, user-support-issues
> Fix For: 0.9.0
>
> Attachments: Screen Shot 2021-03-26 at 1.42.42 AM.png
>
>
> org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles
> filters the input files based on last saved checkpoint, which was the
> modification date from last read file. However, the last read file's
> modification date could be duplicated for multiple files and resulted in
> skipping a few of them when reading up to source limit. An illustration is
> shown in the attached picture.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)