Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/14731
# I'm going to scan through and tune them elsewhere; really I'm going by
uses of the listFiles calls
There's actually no significant use elsewhere that I can see; just a couple
of uses which filter on filename âso there is no cost penalty.
* `SparkHadoopUtil.listLeafStatuses()` does implement its own directory
recursion to find files; FileSystem.listFiles(path, true) does that, and on S3A
will do flat scan that is O(files/5000); no directory overhead at all.
* Otherwise, globStatus() can be pretty slow against object stores, but the
fix there isn't in the client code; it means someone needs to implement
[HADOOP-13371](https://issues.apache.org/jira/browse/HADOOP-13371), *S3A
globber to use bulk listObject call over recursive directory scan* âmore
specifically, an implementation scalable to production datasets.
Returning to this patch, should I cut out the caching? I think it is
superfluous.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]