[GitHub] spark pull request #15229: [SPARK-17654] [SQL] Propagate bucketing informati...

tejasapatil Fri, 23 Sep 2016 22:41:44 -0700

GitHub user tejasapatil opened a pull request:

    https://github.com/apache/spark/pull/15229


    [SPARK-17654] [SQL] Propagate bucketing information for Hive tables to / 
from Catalog

    ## What changes were proposed in this pull request?
    
    Currently Spark does not respect bucketing for Hive tables. This PR 
includes following changes:
    
    - will extract table's bucketing information in `HiveClientImpl`
    - while writing table info to metastore, `MetastoreRelation` now populates 
the bucketing information in the hive `Table` object
    - `HiveTableScanExec` now exposes `outputPartitioning` and `outputOrdering` 
as per bucketing spec.
    - `InsertIntoHiveTable` now exposes `requiredChildDistribution` and 
`requiredChildOrdering` based on the target table's bucketing spec.
    
    TODOs (which will be done in linked PRs and not this one):
    
    - [ ] `ClusteredDistribution` does not guarantee the number of partitions 
(which corresponds to output bucket files created) generated. This will require 
adding strict guarantees to `ClusteredDistribution`. I think it will need more 
thought and better to do incrementally and not packing in this PR.
    - [ ] While writing to bucketed files, Hive's hashing function should be 
used. I have a PR open to implement Hive hashing native in Spark : 
https://github.com/apache/spark/pull/15047
    - [ ] Allow creating Hive bucketed tables
    
    ## How was this patch tested?
    
    Tested with Hive tables created locally. Adding a new test case will need 
implementing bucketed table creation which is not supported :( Suggestions 
welcome.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tejasapatil/spark 
SPARK-17654_hive_extract_bucketing

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15229.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15229
    
----
commit caef89a198dac2fee4afaad622e2ecc11f200836
Author: Tejas Patil <tej...@fb.com>
Date:   2016-08-23T20:45:00Z

    Support bucketing for Hive tables

commit ee79dd2ae1e174ab38fc5f6b10f5a9a2e2721533
Author: Tejas Patil <tej...@fb.com>
Date:   2016-08-23T20:45:00Z

    Support bucketing for Hive tables

commit 8726cc6430cbeaf8c2eebd7cef40199a7c563218
Author: Tejas Patil <tej...@fb.com>
Date:   2016-09-24T03:22:07Z

    Merge remote-tracking branch 'origin/SPARK-17654_hive_extract_bucketing' 
into SPARK-17654_hive_extract_bucketing_2
    
    # Conflicts:
    #
    sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableSca
    nExec.scala

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15229: [SPARK-17654] [SQL] Propagate bucketing informati...

Reply via email to