hotienvu commented on a change in pull request #2157:
URL: https://github.com/apache/hudi/pull/2157#discussion_r503778038
##########
File path:
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/DFSPathSelector.java
##########
@@ -119,4 +103,25 @@ public DFSPathSelector(TypedProperties props,
Configuration hadoopConf) {
throw new HoodieIOException("Unable to read from source from checkpoint:
" + lastCheckpointStr, ioe);
}
}
+
+ /**
+ * List files recursively, filter out illegible files/directories while
doing so.
+ */
+ private List<FileStatus> listEligibleFiles(FileSystem fs, Path path, long
lastCheckpointTime) throws IOException {
+ // skip files/dirs whose names start with (_, ., etc)
+ FileStatus[] statuses = fs.listStatus(path, file ->
Review comment:
Hi Vinoth, thanks for the feedback. Unfortunately the old
listFiles(,true) returns everything so we will have to check all files under
the path. Also the logic will be a bit more complicated cos we will have to
check for all the sub-dir along the path too e.g. foo/_bar/file. On the other
hand, walking recursively allow us to exit early as soon as the file/directory
name doesn't match. This will also allow us to do more directory skipping
(based on modification time) in the future.
Under the hood, listFiles also call listStatus recursively so performance
wise there should be no impact.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]