[
https://issues.apache.org/jira/browse/HUDI-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309282#comment-17309282
]
Raymond Xu commented on HUDI-1723:
----------------------------------
[~vinoth] Related to this, I'm thinking maybe it's time to create another path
selector as the default one for deltastreamer say `DFSRangePathSelector` to
combine the merits of DatePathSelector and reliable checkpointing logic. The
current DFSPathSelector scans the whole source base path every time, could
induce high cloud billing costs without obvious warning to users.
`DFSRangePathSelector` is highlighting the fact that input path is scanned in a
moving-window, and it should be generic enough to handle date range and others.
> DFSPathSelector skips files with the same modify date when read up to source
> limit
> ----------------------------------------------------------------------------------
>
> Key: HUDI-1723
> URL: https://issues.apache.org/jira/browse/HUDI-1723
> Project: Apache Hudi
> Issue Type: Bug
> Components: DeltaStreamer
> Reporter: Raymond Xu
> Priority: Critical
> Fix For: 0.9.0
>
> Attachments: Screen Shot 2021-03-26 at 1.42.42 AM.png
>
>
> org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles
> filters the input files based on last saved checkpoint, which was the
> modification date from last read file. However, the last read file's
> modification date could be duplicated for multiple files and resulted in
> skipping a few of them when reading up to source limit. An illustration is
> shown in the attached picture.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)