GitHub user petermaxlee opened a pull request:
https://github.com/apache/spark/pull/15235
[SPARK-17661][SQL] Consolidate various listLeafFiles implementations
## What changes were proposed in this pull request?
There are 4 listLeafFiles-related functions in Spark:
- ListingFileCatalog.listLeafFiles (which calls
HadoopFsRelation.listLeafFilesInParallel if the number of paths passed in is
greater than a threshold; if it is lower, then it has its own serial version
implemented)
- HadoopFsRelation.listLeafFiles (called only by
HadoopFsRelation.listLeafFilesInParallel)
- HadoopFsRelation.listLeafFilesInParallel (called only by
ListingFileCatalog.listLeafFiles)
It is actually very confusing and error prone because there are effectively
two distinct implementations for the serial version of listing leaf files. This
code can be improved by:
- Move all file listing code into ListingFileCatalog, since it is the only
class that needs this.
- Keep only one function for listing files in serial.
## How was this patch tested?
This change should be covered by existing unit and integration tests. I
also moved a test case for HadoopFsRelation.shouldFilterOut from
HadoopFsRelationSuite to ListingFileCatalogSuite.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/petermaxlee/spark SPARK-17661
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15235.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15235
----
commit 2a76ec1da54fd19f5c8eb621ee5c823f69efe855
Author: petermaxlee <[email protected]>
Date: 2016-09-25T06:29:07Z
[SPARK-17661][SQL] Consolidate various listLeafFiles implementations
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]