[GitHub] spark pull request #17644: [SPARK-17729] [SQL] Enable creating hive bucketed...

tejasapatil Sat, 15 Apr 2017 16:46:38 -0700

GitHub user tejasapatil opened a pull request:

    https://github.com/apache/spark/pull/17644


    [SPARK-17729] [SQL] Enable creating hive bucketed tables

    ## What changes were proposed in this pull request?
    
    Hive allows inserting data to bucketed table without guaranteeing bucketed 
and sorted-ness based on these two configs : `hive.enforce.bucketing` and 
`hive.enforce.sorting`.
    
    What does this PR achieve ?
    - Spark will disallow users from writing outputs to hive bucketed tables by 
default (given that output won't adhere with Hive's semantics).
    - IF user still wants to write to hive bucketed table, the only resort is 
to use `hive.enforce.bucketing=false` and `hive.enforce.sorting=false` which 
means user does NOT care about bucketing guarantees.
    
    Changes done in this PR:
    - Extract table's bucketing information in `HiveClientImpl`
    - While writing table info to metastore, `HiveClientImpl` now populates the 
bucketing information in the hive `Table` object
    - `InsertIntoHiveTable` allows inserts to bucketed table only if both 
`hive.enforce.bucketing` and `hive.enforce.sorting` are `false`
    
    Ability to create bucketed tables will enable adding test cases to Spark 
while I add pieces to make Spark support hive bucketing (eg. 
https://github.com/apache/spark/pull/15229, 
https://github.com/apache/spark/pull/15047, 
https://github.com/apache/spark/pull/15040)
    
    ## How was this patch tested?
    - Added test for creating bucketed and sorted table.
    - Added test to ensure that INSERTs fail if strict bucket / sort is enforced
    - Added test to ensure that INSERTs can go through if strict bucket / sort 
is NOT enforced
    - Added test to validate that bucketing information shows up in output of 
DESC FORMATTED
    - Added test to ensure that `SHOW CREATE TABLE` works for hive bucketed 
tables

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tejasapatil/spark 
SPARK-17729_create_bucketed_table

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17644.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17644
    
----
commit 0348a969a43872da484dd13f1f211b966839baec
Author: Tejas Patil <[email protected]>
Date:   2017-04-15T23:42:46Z

    [SPARK-17729] [SQL] Enable creating hive bucketed tables

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17644: [SPARK-17729] [SQL] Enable creating hive bucketed...

Reply via email to