GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/6194
[SPARK-2883] [SQL] ORC data source for Spark SQL
This PR is an update of #6135 authored by @zhzhan from Hortonworks.
----
This PR implements a Spark SQL data source for accessing ORC files.
> **NOTE**
>
> Although ORC is now an Apache TLP, the codebase is still tightly coupled
with Hive. That's why the new ORC data source is under
`org.apache.spark.sql.hive` package.
## New Features
1. New save/load methods provided:
- `df.saveAsOrcFile()`
Used to save the table in ORC format.
- `sqlContext.orcFile()`
Used to import ORC file as a Spark SQL table.
To enable these two methods, please add the following line to enable
corresponding implicit conversions:
```scala
import org.apache.spark.sql.hive.orc._
```
1. Support for complex data types (i.e. array, map, and struct)
1. Aware of common optimizations provided by Spark SQL:
- Column pruning
- Partitioning pruning
- Filter push-down
1. Saving/loading ORC files without contacting Hive metastore
1. The orc file is operated in HiveContext, the only reason is due to
package issue, and we donât want to bring in hive dependency into spark sql.
Note that orc operations does not relies on Hive metastore.
## Future Work
1. Schema evolution support
1. Hive metastore table conversion
## Acknowledgements
This PR also include initial work done by @scwf from Huawei (PR #3753).
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark polishing-orc
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/6194.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6194
----
commit 62fef1829f78b15c5caf6fc825ebcdf045eecbe5
Author: Zhan Zhang <[email protected]>
Date: 2015-05-13T16:21:26Z
orc data source support
commit cd1b4340d35cb9ff9c820329e6d6e6dda094b2f0
Author: Zhan Zhang <[email protected]>
Date: 2015-05-13T19:17:02Z
minor change
commit aced00f8acb6f18f6f8644fa6dd99affa186513f
Author: Zhan Zhang <[email protected]>
Date: 2015-05-13T23:12:49Z
predicate fix
commit f156bf0af97a0ac11392c59c99e947eef04b96b7
Author: Zhan Zhang <[email protected]>
Date: 2015-05-14T00:01:06Z
reuse test suite
commit 22b8a58c548db143f3e5245993a4aaacfd0802ff
Author: Zhan Zhang <[email protected]>
Date: 2015-05-14T02:48:02Z
save mode fix
commit 00dd24c1a83796a6016aa2bb945c759587480f35
Author: Zhan Zhang <[email protected]>
Date: 2015-05-14T20:19:30Z
resolve review comments
commit 3501a9b70161ad41ef4b5718c2b57fb32188d5e9
Author: Zhan Zhang <[email protected]>
Date: 2015-05-14T20:22:07Z
resolve review comments
commit 4bc937fa37c2674c007726f6c9bb25911378049f
Author: Cheng Lian <[email protected]>
Date: 2015-05-15T16:17:51Z
Polishes the ORC data source
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]