GitHub user ericl opened a pull request:

    [SPARK-18769] [SQL] Fix regression in file listing performance for 
non-catalog tables

    ## What changes were proposed in this pull request?
    In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed 
to InMemoryFileIndex). This introduced a regression where parallelism could 
only be introduced at the very top of the tree. However, in many cases (e.g. 
non-catalog tables), the top of the tree is only a single directory.
    This PR simplifies and fixes the parallel recursive listing code to allow 
parallelism to be introduced at any level during recursive descent (though note 
that once decide to list a sub-tree in parallel, the sub-tree is listed in 
serial on executors).
    cc @mallman  @cloud-fan 
    ## How was this patch tested?
    Checked metrics in unit tests.

You can merge this pull request into a Git repository by running:

    $ git pull spark-18679

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16112
commit 3102aa3f3e05871ab11658d2f0d9f2b2451f3a40
Author: Eric Liang <>
Date:   2016-12-02T02:35:11Z

    Thu Dec  1 18:35:11 PST 2016

commit db664396b3892de45507d9c82eed7d070bdd82dc
Author: Eric Liang <>
Date:   2016-12-02T02:37:23Z

    Thu Dec  1 18:37:23 PST 2016


If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at or file a JIRA ticket
with INFRA.

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to