GitHub user mallman opened a pull request:
https://github.com/apache/spark/pull/15998
[SPARK-18572][SQL] Add a method `listPartitionName` to `ExternalCatalog`
(Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572)
## What changes were proposed in this pull request?
Currently Spark answers the `SHOW PARTITIONS` command by fetching all of
the table's partition metadata from the external catalog and constructing
partition names therefrom. The Hive client has a `getPartitionNames` method
which is many times faster for this purpose, with the performance improvement
scaling with the number of partitions in a table.
To test the performance impact of this PR, I ran the `SHOW PARTITIONS`
command on two Hive tables with large numbers of partitions. One table has
~17,800 partitions, and the other has ~95,000 partitions. For the purposes of
this PR, I'll call the former table `table1` and the latter table `table2`. I
ran 5 trials for each table with before-and-after versions of this PR. The
results are as follows:
Spark at bdc8153, `SHOW PARTITIONS table1`, times in seconds:
7.901
3.983
4.018
4.331
4.261
Spark at bdc8153, `SHOW PARTITIONS table2`
(Timed out after 10 minutes with a `SocketTimeoutException`.)
Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
3.801
0.449
0.395
0.348
0.336
Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
5.184
1.63
1.474
1.519
1.41
Taking the best times from each trial, we get a 12x performance improvement
for a table with ~17,800 partitions and at least a 426x improvement for a table
with ~95,000 partitions. More significantly, the latter command doesn't even
complete with the current code in master.
This is actually a patch we've been using in-house at VideoAmp since Spark
1.1. It's made all the difference in the practical usability of our largest
tables. Even with tables with about 1,000 partitions there's a performance
improvement of about 2-3x.
## How was this patch tested?
I added a unit test to `VersionsSuite` which tests that the Hive client's
`getPartitionNames` method returns the correct number of partitions.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/VideoAmp/spark-public
spark-18572-list_partition_names
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15998.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15998
----
commit f20ef8bd0bbc68396546c60bb816fe9caf02042f
Author: Michael Allman <[email protected]>
Date: 2016-11-23T21:50:14Z
[SPARK-18572][SQL] Add a method `listPartitionName` to `ExternalCatalog`
and an implementation in `HiveExternalCatalog` that calls the Hive
`getPartitionNames` method.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]