GitHub user tejasapatil opened a pull request:
https://github.com/apache/spark/pull/11891
Use ORC data source for SQL queries on ORC tables
## What changes were proposed in this pull request?
This patch enables use of OrcRelation for SQL queries which read data from
Hive tables. Changes in this patch:
- Added a new rule `OrcConversions` which would alter the plan to use
`OrcRelation`. In this diff, the conversion is done only for reads.
- Added a new config `spark.sql.hive.convertMetastoreOrc` to control the
conversion
BEFORE
```
scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None
== Analyzed Logical Plan ==
key: string, value: string
Project [key#171,value#172]
+- MetastoreRelation default, orc_table, None
== Optimized Logical Plan ==
MetastoreRelation default, orc_table, None
== Physical Plan ==
HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table,
None
```
AFTER
```
scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
== Parsed Logical Plan ==
'Project [unresolvedalias(*, None)]
+- 'UnresolvedRelation `orc_table`, None
== Analyzed Logical Plan ==
key: string, value: string
Project [key#76,value#77]
+- SubqueryAlias orc_table
+- Relation[key#76,value#77] ORC part: struct<>, data:
struct<key:string,value:string>
== Optimized Logical Plan ==
Relation[key#76,value#77] ORC part: struct<>, data:
struct<key:string,value:string>
== Physical Plan ==
WholeStageCodegen
: +- Scan ORC part: struct<>, data:
struct<key:string,value:string>[key#76,value#77] InputPaths:
file:/user/hive/warehouse/orc_table
```
## How was this patch tested?
- Added a new unit test. Ran existing unit tests
- Ran with production like data
## Performance gains
Ran on a production table in Facebook (note that the data was in DWRF file
format which is similar to ORC)
Best case : when there was no matching rows for the predicate in the query
(everything is filtered out)
```
CPU time Wall time Total
wall time across all tasks
================================================================
Without the change 541_515 sec 25.0 mins 165.8 hours
With change 407 sec 1.5 mins 15 mins
```
Average case: A subset of rows in the data match the query predicate
```
CPU time Wall time Total
wall time across all tasks
================================================================
Without the change 624_630 sec 31.0 mins 199.0 h
With change 14_769 sec 5.3 mins 7.7 h
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tejasapatil/spark orc_ppd
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11891.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11891
----
commit 0938c082cc1cb0b129f41340f1e75767aedbc3e1
Author: Tejas Patil <[email protected]>
Date: 2016-03-22T16:44:14Z
Use ORC data source for SQL queries on ORC tables
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]