GitHub user cloud-fan opened a pull request:

    https://github.com/apache/spark/pull/17187

    [SPARK-19847][SQL] port hive read to FileFormat API

    ## What changes were proposed in this pull request?
    
    implement the read logic in `HiveFileFormat`, to unify the table read path 
between data source and hive serde tables.
    
    The major change is, hive partition may have a different serde, so the 
planner should put more information in `PartitionedFile` and send it to 
executors.
    
    Tow things need to be improved in the future:
    1. Due to the way we read hive table files, we do not support reading a 
partial file yet, which may reduce the parallelism for large files.
    2. Hive tables with storage handler(non-file-based) still go to the old 
code path.
    
    ## How was this patch tested?
    
    existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloud-fan/spark hive-read

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17187.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17187
    
----
commit ad478870c652553b8b225a569e460fb6ccef0c36
Author: Wenchen Fan <[email protected]>
Date:   2017-03-02T07:15:42Z

    port hive read to FileFormat API

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to