Github user steveloughran commented on a diff in the pull request:
https://github.com/apache/spark/pull/17745#discussion_r212391371
--- Diff:
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
---
@@ -196,29 +191,29 @@ class FileInputDStream[K, V, F <: NewInputFormat[K,
V]](
logDebug(s"Getting new files for time $currentTime, " +
s"ignoring files older than $modTimeIgnoreThreshold")
- val newFileFilter = new PathFilter {
- def accept(path: Path): Boolean = isNewFile(path, currentTime,
modTimeIgnoreThreshold)
- }
- val directoryFilter = new PathFilter {
- override def accept(path: Path): Boolean =
fs.getFileStatus(path).isDirectory
- }
- val directories = fs.globStatus(directoryPath,
directoryFilter).map(_.getPath)
+ val directories =
Option(fs.globStatus(directoryPath)).getOrElse(Array.empty[FileStatus])
--- End diff --
globStatus is flawed; key limit is that it does a tree walk. It needs to be
replaced with an object-store-list specific one. See
[HADOOP-13371](https://issues.apache.org/jira/browse/HADOOP-13371).
The issue with implementing an s3a flat-list and filter is that if the
wildcard is a few entries up from the child path and there are lots of
children, e..g
```
s3a://bucket/data/year=201?/month=*/day=*/
```
then if there are many files under year/month/day entries, all get listed
during the filter.
What I think would need to be done is to be able to config the FS to limit
the depth of where it switches to bulk listing; so here I could say "depth=2",
and so the year=? would be done via globbing, but the month= and day= would be
better.
Or maybe just start with making the whole thing optional, and let the
caller deal with it.
Anyway, options here
* fix the Hadoop side call. Nice and broadly useful
* see if spark can be moved off the globStatus call. Will change matching.
But if you provide a new "cloudstore" connector, that could be done, couldn't
it?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]