GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/17187
[SPARK-19847][SQL] port hive read to FileFormat API
## What changes were proposed in this pull request?
implement the read logic in `HiveFileFormat`, to unify the table read path
between data source and hive serde tables.
The major change is, hive partition may have a different serde, so the
planner should put more information in `PartitionedFile` and send it to
executors.
Tow things need to be improved in the future:
1. Due to the way we read hive table files, we do not support reading a
partial file yet, which may reduce the parallelism for large files.
2. Hive tables with storage handler(non-file-based) still go to the old
code path.
## How was this patch tested?
existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark hive-read
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17187.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17187
----
commit ad478870c652553b8b225a569e460fb6ccef0c36
Author: Wenchen Fan <[email protected]>
Date: 2017-03-02T07:15:42Z
port hive read to FileFormat API
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]