[GitHub] spark pull request #17745: [SPARK-17159][Streaming] optimise check for new f...

steveloughran Thu, 23 Aug 2018 10:25:36 -0700

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17745#discussion_r212391371
  
    --- Diff: 
streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
 ---
    @@ -196,29 +191,29 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, 
V]](
           logDebug(s"Getting new files for time $currentTime, " +
             s"ignoring files older than $modTimeIgnoreThreshold")
     
    -      val newFileFilter = new PathFilter {
    -        def accept(path: Path): Boolean = isNewFile(path, currentTime, 
modTimeIgnoreThreshold)
    -      }
    -      val directoryFilter = new PathFilter {
    -        override def accept(path: Path): Boolean = 
fs.getFileStatus(path).isDirectory
    -      }
    -      val directories = fs.globStatus(directoryPath, 
directoryFilter).map(_.getPath)
    +      val directories = 
Option(fs.globStatus(directoryPath)).getOrElse(Array.empty[FileStatus])
    --- End diff --
    
    globStatus is flawed; key limit is that it does a tree walk. It needs to be 
replaced with an object-store-list specific one. See 
[HADOOP-13371](https://issues.apache.org/jira/browse/HADOOP-13371).
    
    The issue with implementing an s3a flat-list and filter is that if the 
wildcard is a few entries up from the child path and there are lots of 
children, e..g
    
    ```
    s3a://bucket/data/year=201?/month=*/day=*/
    ```
    
    then if there are many files under year/month/day entries, all get listed 
during the filter. 
    
    What I think would need to be done is to be able to config the FS to limit 
the depth of where it switches to bulk listing; so here I could say "depth=2", 
and so the year=? would be done via globbing, but the month= and day= would be 
better.
    
    Or maybe just start with making the whole thing optional, and let the 
caller deal with it.
    
    Anyway, options here
    
    * fix the Hadoop side call. Nice and broadly useful
    * see if spark can be moved off the globStatus call. Will change matching.  
But if you provide a new "cloudstore" connector, that could be done, couldn't 
it?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17745: [SPARK-17159][Streaming] optimise check for new f...

Reply via email to