[GitHub] spark pull request #15228: [SPARK-17654] [SQL] Propagate bucketing informati...

tejasapatil Fri, 23 Sep 2016 19:45:07 -0700

GitHub user tejasapatil opened a pull request:

    https://github.com/apache/spark/pull/15228


    [SPARK-17654] [SQL] Propagate bucketing information for Hive tables to / 
from Catalog

    ## What changes were proposed in this pull request?
    
    Currently Spark does not respect bucketing for Hive tables. This PR 
includes following changes:
    
    - will extract table's bucketing information in `HiveClientImpl`
    - while writing table info to metastore, `MetastoreRelation` now populates 
the bucketing information in the hive `Table` object
    - `HiveTableScanExec` now exposes `outputPartitioning` and `outputOrdering` 
as per bucketing spec.
    - `InsertIntoHiveTable` now exposes `requiredChildDistribution` and 
`requiredChildOrdering` based on the target table's bucketing spec.
    
    TODOs (which will be done in linked PRs and not this one):
    
    - [ ] `ClusteredDistribution` does not guarantee the number of partitions 
(which corresponds to output bucket files created) generated. This will require 
adding strict guarantees to `ClusteredDistribution`. I think it will need more 
thought and better to do incrementally and not packing in this PR.
    - [ ] While writing to bucketed files, Hive's hashing function should be 
used. I have a PR open to implement Hive hashing native in Spark : 
https://github.com/apache/spark/pull/15047
    - [ ] Allow creating Hive bucketed tables
    
    ## How was this patch tested?
    
    Tested with Hive tables created locally. Adding a new test case will need 
implementing bucketed table creation which is not supported :( Suggestions 
welcome.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tejasapatil/spark 
SPARK-17654_hive_extract_bucketing

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15228.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15228
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15228: [SPARK-17654] [SQL] Propagate bucketing informati...

Reply via email to