GitHub user ericl opened a pull request: https://github.com/apache/spark/pull/16112
[SPARK-18769] [SQL] Fix regression in file listing performance for non-catalog tables ## What changes were proposed in this pull request? In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. non-catalog tables), the top of the tree is only a single directory. This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors). cc @mallman @cloud-fan ## How was this patch tested? Checked metrics in unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ericl/spark spark-18679 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16112.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16112 ---- commit 3102aa3f3e05871ab11658d2f0d9f2b2451f3a40 Author: Eric Liang <e...@databricks.com> Date: 2016-12-02T02:35:11Z Thu Dec 1 18:35:11 PST 2016 commit db664396b3892de45507d9c82eed7d070bdd82dc Author: Eric Liang <e...@databricks.com> Date: 2016-12-02T02:37:23Z Thu Dec 1 18:37:23 PST 2016 ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org