[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

steveloughran Sat, 20 Aug 2016 05:54:03 -0700

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/14731
  
    # I'm going to scan through and tune them elsewhere; really I'm going by 
uses of the listFiles calls
    
    There's actually no significant use elsewhere that I can see; just a couple 
of uses which filter on filename âso there is no cost penalty.
    
    * `SparkHadoopUtil.listLeafStatuses()` does implement its own directory 
recursion to find files; FileSystem.listFiles(path, true) does that, and on S3A 
will do flat scan that is O(files/5000); no directory overhead at all.
    * Otherwise, globStatus() can be pretty slow against object stores, but the 
fix there isn't in the client code; it means someone needs to implement 
[HADOOP-13371](https://issues.apache.org/jira/browse/HADOOP-13371), *S3A 
globber to use bulk listObject call over recursive directory scan* âmore 
specifically, an implementation scalable to production datasets. 
    
    Returning to this patch, should I cut out the caching? I think it is 
superfluous.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14731: [SPARK-17159] [streaming]: optimise check for new files ...

Reply via email to