[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...

steveloughran Sat, 27 Aug 2016 04:40:53 -0700

Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14731#discussion_r76514866
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 
---
    @@ -244,6 +244,31 @@ class SparkHadoopUtil extends Logging {
       }
     
       /**
    +   * List directories/files matching the path and return the `FileStatus` 
results.
    +   * If the pattern is not a regexp then a simple `getFileStatus(pattern)`
    +   * is called to get the status of that path.
    +   * If the path/pattern does not match anything in the filesystem,
    +   * an empty sequence is returned.
    +   * @param pattern pattern
    +   * @return a possibly empty array of FileStatus entries
    +   */
    +  def globToFileStatus(pattern: Path): Array[FileStatus] = {
    --- End diff --
    
    It goes with the `globPathIfNecessary` call âyou want to rename it to be 
consistent.
    
    
    Regarding the FS APIs, There's way to many list operations in the FS APIs, 
each with different flaws.
    
    1. The simple `list(path, filter): Array[FS]` operations don't scale to a 
directory with hundreds of thousands of files, hence the remote iterator 
versions
    1. None of them provide any consistency guarantees. Worth knowing. This is 
more common in remote iterators as the iteration window is bigger, but even in 
those that return arrays, in a large enough directory things may change during 
the enum
    1. Anything that treewaks is very suboptimal on blobstores, somewhat 
inefficient for deep trees.
    1. `listFiles(path, recursive=true)` is the sole one which object stores 
can currently optimise by avoiding the treewalk and just doing a bulk list. 
[HADOOP-13208](https://issues.apache.org/jira/browse/HADOOP-13208) has added 
that for S3A.
    1. ..but that method filters out all directories, which means that apps 
which do want directories too are out of luck.
    1. globStatus() is even less efficient than the others ... have a look at 
the source to see why.
    1. In [HADOOP-13371](https://issues.apache.org/jira/browse/HADOOP-13371) 
I'm exploring an optimised globber, but I don't want to write one which 
collapses at scale (i.e in production).
    
    I've added some comments in HADOOP-13371 about what to do there, I will 
probably do that "no regexp -> simple return" strategy implemented in this 
patch. But it will only benefit s3a in Hadoop 2.8+; patching spark benefits 
everything.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14731: [SPARK-17159] [streaming]: optimise check for new...

Reply via email to