[
https://issues.apache.org/jira/browse/HUDI-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339865#comment-17339865
]
Raymond Xu commented on HUDI-1723:
----------------------------------
As suggested by [~vinoth] last time, a new implementation of DFSSource
specifically for S3 would be better (say ParquetS3Source). The checkpointing
can leverage S3 access log to accurately record the last read file under a
certain S3 prefix, instead of relying on modification time which causes racing
condition and file skipping, a more severe issue to S3 given it has only up to
seconds in time precision.
> DFSPathSelector skips files with the same modify date when read up to source
> limit
> ----------------------------------------------------------------------------------
>
> Key: HUDI-1723
> URL: https://issues.apache.org/jira/browse/HUDI-1723
> Project: Apache Hudi
> Issue Type: Bug
> Components: DeltaStreamer
> Reporter: Raymond Xu
> Assignee: Raymond Xu
> Priority: Blocker
> Labels: pull-request-available, sev:critical, user-support-issues
> Fix For: 0.9.0
>
> Attachments: Screen Shot 2021-03-26 at 1.42.42 AM.png
>
>
> org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles
> filters the input files based on last saved checkpoint, which was the
> modification date from last read file. However, the last read file's
> modification date could be duplicated for multiple files and resulted in
> skipping a few of them when reading up to source limit. An illustration is
> shown in the attached picture.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)