GitHub user tejasapatil opened a pull request:
https://github.com/apache/spark/pull/15229
[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to /
from Catalog
## What changes were proposed in this pull request?
Currently Spark does not respect bucketing for Hive tables. This PR
includes following changes:
- will extract table's bucketing information in `HiveClientImpl`
- while writing table info to metastore, `MetastoreRelation` now populates
the bucketing information in the hive `Table` object
- `HiveTableScanExec` now exposes `outputPartitioning` and `outputOrdering`
as per bucketing spec.
- `InsertIntoHiveTable` now exposes `requiredChildDistribution` and
`requiredChildOrdering` based on the target table's bucketing spec.
TODOs (which will be done in linked PRs and not this one):
- [ ] `ClusteredDistribution` does not guarantee the number of partitions
(which corresponds to output bucket files created) generated. This will require
adding strict guarantees to `ClusteredDistribution`. I think it will need more
thought and better to do incrementally and not packing in this PR.
- [ ] While writing to bucketed files, Hive's hashing function should be
used. I have a PR open to implement Hive hashing native in Spark :
https://github.com/apache/spark/pull/15047
- [ ] Allow creating Hive bucketed tables
## How was this patch tested?
Tested with Hive tables created locally. Adding a new test case will need
implementing bucketed table creation which is not supported :( Suggestions
welcome.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tejasapatil/spark
SPARK-17654_hive_extract_bucketing
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15229.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15229
----
commit caef89a198dac2fee4afaad622e2ecc11f200836
Author: Tejas Patil <[email protected]>
Date: 2016-08-23T20:45:00Z
Support bucketing for Hive tables
commit ee79dd2ae1e174ab38fc5f6b10f5a9a2e2721533
Author: Tejas Patil <[email protected]>
Date: 2016-08-23T20:45:00Z
Support bucketing for Hive tables
commit 8726cc6430cbeaf8c2eebd7cef40199a7c563218
Author: Tejas Patil <[email protected]>
Date: 2016-09-24T03:22:07Z
Merge remote-tracking branch 'origin/SPARK-17654_hive_extract_bucketing'
into SPARK-17654_hive_extract_bucketing_2
# Conflicts:
#
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableSca
nExec.scala
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]