[GitHub] spark pull request #16112: [SPARK-18769] [SQL] Fix regression in file listin...

ericl Thu, 01 Dec 2016 18:44:39 -0800

GitHub user ericl opened a pull request:

    https://github.com/apache/spark/pull/16112


    [SPARK-18769] [SQL] Fix regression in file listing performance for 
non-catalog tables

    ## What changes were proposed in this pull request?
    
    In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed 
to InMemoryFileIndex). This introduced a regression where parallelism could 
only be introduced at the very top of the tree. However, in many cases (e.g. 
non-catalog tables), the top of the tree is only a single directory.
    
    This PR simplifies and fixes the parallel recursive listing code to allow 
parallelism to be introduced at any level during recursive descent (though note 
that once decide to list a sub-tree in parallel, the sub-tree is listed in 
serial on executors).
    
    cc @mallman  @cloud-fan 
    
    ## How was this patch tested?
    
    Checked metrics in unit tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ericl/spark spark-18679

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16112
    
----
commit 3102aa3f3e05871ab11658d2f0d9f2b2451f3a40
Author: Eric Liang <[email protected]>
Date:   2016-12-02T02:35:11Z

    Thu Dec  1 18:35:11 PST 2016

commit db664396b3892de45507d9c82eed7d070bdd82dc
Author: Eric Liang <[email protected]>
Date:   2016-12-02T02:37:23Z

    Thu Dec  1 18:37:23 PST 2016

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16112: [SPARK-18769] [SQL] Fix regression in file listin...

Reply via email to