[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...

mallman Wed, 23 Nov 2016 14:22:48 -0800

GitHub user mallman opened a pull request:

    https://github.com/apache/spark/pull/15998


    [SPARK-18572][SQL] Add a method `listPartitionName` to `ExternalCatalog`

    (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572)
    
    ## What changes were proposed in this pull request?
    
    Currently Spark answers the `SHOW PARTITIONS` command by fetching all of 
the table's partition metadata from the external catalog and constructing 
partition names therefrom. The Hive client has a `getPartitionNames` method 
which is many times faster for this purpose, with the performance improvement 
scaling with the number of partitions in a table.
    
    To test the performance impact of this PR, I ran the `SHOW PARTITIONS` 
command on two Hive tables with large numbers of partitions. One table has 
~17,800 partitions, and the other has ~95,000 partitions. For the purposes of 
this PR, I'll call the former table `table1` and the latter table `table2`. I 
ran 5 trials for each table with before-and-after versions of this PR. The 
results are as follows:
    
    Spark at bdc8153, `SHOW PARTITIONS table1`, times in seconds:
    7.901
    3.983
    4.018
    4.331
    4.261
    
    Spark at bdc8153, `SHOW PARTITIONS table2`
    (Timed out after 10 minutes with a `SocketTimeoutException`.)
    
    Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
    3.801
    0.449
    0.395
    0.348
    0.336
    
    Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
    5.184
    1.63
    1.474
    1.519
    1.41
    
    Taking the best times from each trial, we get a 12x performance improvement 
for a table with ~17,800 partitions and at least a 426x improvement for a table 
with ~95,000 partitions. More significantly, the latter command doesn't even 
complete with the current code in master.
    
    This is actually a patch we've been using in-house at VideoAmp since Spark 
1.1. It's made all the difference in the practical usability of our largest 
tables. Even with tables with about 1,000 partitions there's a performance 
improvement of about 2-3x.
    
    ## How was this patch tested?
    
    I added a unit test to `VersionsSuite` which tests that the Hive client's 
`getPartitionNames` method returns the correct number of partitions.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/VideoAmp/spark-public 
spark-18572-list_partition_names

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15998.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15998
    
----
commit f20ef8bd0bbc68396546c60bb816fe9caf02042f
Author: Michael Allman <[email protected]>
Date:   2016-11-23T21:50:14Z

    [SPARK-18572][SQL] Add a method `listPartitionName` to `ExternalCatalog`
    and an implementation in `HiveExternalCatalog` that calls the Hive
    `getPartitionNames` method.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15998: [SPARK-18572][SQL] Add a method `listPartitionNam...

Reply via email to