[GitHub] spark pull request: Use ORC data source for SQL queries on ORC tab...

tejasapatil Tue, 22 Mar 2016 10:02:18 -0700

GitHub user tejasapatil opened a pull request:

    https://github.com/apache/spark/pull/11891


    Use ORC data source for SQL queries on ORC tables

    ## What changes were proposed in this pull request?
    
    This patch enables use of OrcRelation for SQL queries which read data from 
Hive tables. Changes in this patch:
    
    - Added a new rule `OrcConversions` which would alter the plan to use 
`OrcRelation`. In this diff, the conversion is done only for reads.
    - Added a new config `spark.sql.hive.convertMetastoreOrc` to control the 
conversion
    
    BEFORE 
    
    ```
    scala>  hqlContext.sql("SELECT * FROM orc_table").explain(true)
    == Parsed Logical Plan ==
    'Project [unresolvedalias(*, None)]
    +- 'UnresolvedRelation `orc_table`, None
    
    == Analyzed Logical Plan ==
    key: string, value: string
    Project [key#171,value#172]
    +- MetastoreRelation default, orc_table, None
    
    == Optimized Logical Plan ==
    MetastoreRelation default, orc_table, None
    
    == Physical Plan ==
    HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, 
None
    ```
    
    AFTER
    
    ```
    scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
    == Parsed Logical Plan ==
    'Project [unresolvedalias(*, None)]
    +- 'UnresolvedRelation `orc_table`, None
    
    == Analyzed Logical Plan ==
    key: string, value: string
    Project [key#76,value#77]
    +- SubqueryAlias orc_table
       +- Relation[key#76,value#77] ORC part: struct<>, data: 
struct<key:string,value:string>
    
    == Optimized Logical Plan ==
    Relation[key#76,value#77] ORC part: struct<>, data: 
struct<key:string,value:string>
    
    == Physical Plan ==
    WholeStageCodegen
    :  +- Scan ORC part: struct<>, data: 
struct<key:string,value:string>[key#76,value#77] InputPaths: 
file:/user/hive/warehouse/orc_table
    ```
    
    ## How was this patch tested?
    
    - Added a new unit test. Ran existing unit tests
    - Ran with production like data
    
    ## Performance gains
    
    Ran on a production table in Facebook (note that the data was in DWRF file 
format which is similar to ORC)
    
    Best case : when there was no matching rows for the predicate in the query 
(everything is filtered out)
    
    ```
                                        CPU time          Wall time     Total 
wall time across all tasks
    ================================================================
    Without the change   541_515 sec     25.0 mins    165.8 hours
    With change                      407 sec       1.5 mins     15 mins
    ```
    
    Average case: A subset of rows in the data match the query predicate
    
    ```
                                        CPU time        Wall time     Total 
wall time across all tasks
    ================================================================
    Without the change   624_630 sec     31.0 mins    199.0 h
    With change                14_769 sec      5.3 mins      7.7 h
    ```


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tejasapatil/spark orc_ppd

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11891.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11891
    
----
commit 0938c082cc1cb0b129f41340f1e75767aedbc3e1
Author: Tejas Patil <[email protected]>
Date:   2016-03-22T16:44:14Z

    Use ORC data source for SQL queries on ORC tables

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: Use ORC data source for SQL queries on ORC tab...

Reply via email to