GitHub user tejasapatil opened a pull request:
https://github.com/apache/spark/pull/15228
[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to /
from Catalog
## What changes were proposed in this pull request?
Currently Spark does not respect bucketing for Hive tables. This PR
includes following changes:
- will extract table's bucketing information in `HiveClientImpl`
- while writing table info to metastore, `MetastoreRelation` now populates
the bucketing information in the hive `Table` object
- `HiveTableScanExec` now exposes `outputPartitioning` and `outputOrdering`
as per bucketing spec.
- `InsertIntoHiveTable` now exposes `requiredChildDistribution` and
`requiredChildOrdering` based on the target table's bucketing spec.
TODOs (which will be done in linked PRs and not this one):
- [ ] `ClusteredDistribution` does not guarantee the number of partitions
(which corresponds to output bucket files created) generated. This will require
adding strict guarantees to `ClusteredDistribution`. I think it will need more
thought and better to do incrementally and not packing in this PR.
- [ ] While writing to bucketed files, Hive's hashing function should be
used. I have a PR open to implement Hive hashing native in Spark :
https://github.com/apache/spark/pull/15047
- [ ] Allow creating Hive bucketed tables
## How was this patch tested?
Tested with Hive tables created locally. Adding a new test case will need
implementing bucketed table creation which is not supported :( Suggestions
welcome.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tejasapatil/spark
SPARK-17654_hive_extract_bucketing
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15228.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15228
----
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]