[GitHub] spark pull request #17943: [SPARK-20682][SQL] Implement new ORC data source ...

dongjoon-hyun Wed, 10 May 2017 19:00:13 -0700

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/17943


    [SPARK-20682][SQL] Implement new ORC data source based on Apache ORC

    ## What changes were proposed in this pull request?
    
    Since [SPARK-2883](https://issues.apache.org/jira/browse/SPARK-2883), 
Apache Spark supports Apache ORC inside `sql/hive` module with Hive dependency. 
This issue aims to add a new and faster ORC data source inside `sql/core` and 
to replace the old ORC data source eventually. In this issue, the latest 
[Apache ORC 1.4.0](https://orc.apache.org/news/2017/05/08/ORC-1.4.0/) (released 
yesterday) is used.
    
    There are four key benefits.
    
    - **Speed**: Use both Spark `ColumnarBatch` and ORC `RowBatch` together 
later. In this PR, only `RowBatch` is used. This is faster than the current 
implementation in Spark. For `ColumnarBatch`, we need to benchmark and choose 
the fastest way to use it later. (Please refer some discussion on #17924)
    - **Stability**: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
community more.
    - **Usability**: User can use `ORC` data sources without hive module, i.e, 
`-Phive`.
    - **Maintainability**: Reduce the Hive dependency and can remove old legacy 
code later.
    
    The followings are two examples of comparisons in `OrcReadBenchmark.scala`.
    ```
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4
    Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
    
    SQL Single Int Column Scan:              Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    SQL ORC Vectorized Reader                      278 /  320         56.5      
    17.7       1.0X
    SQL ORC MR Reader                              348 /  358         45.2      
    22.1       0.8X
    HIVE ORC MR Reader                             418 /  430         37.6      
    26.6       0.7X
    
    Partitioned Table:                       Best/Avg Time(ms)    Rate(M/s)   
Per Row(ns)   Relative
    
------------------------------------------------------------------------------------------------
    SQL Read data column                           273 /  283         57.6      
    17.4       1.0X
    SQL Read partition column                      252 /  266         62.5      
    16.0       1.1X
    SQL Read both columns                          283 /  293         55.5      
    18.0       1.0X
    HIVE Read data column                          510 /  520         30.8      
    32.4       0.5X
    HIVE Read partition column                     420 /  425         37.5      
    26.7       0.7X
    HIVE Read both columns                         527 /  538         29.9      
    33.5       0.5X
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins tests with newly added test suites in `sql/core`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-20682-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17943.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17943
    
----
commit 70bc00e3695ddec164ce626602a5e7f4b425f780
Author: Dongjoon Hyun <dongj...@apache.org>
Date:   2017-04-25T19:30:24Z

    [SPARK-20682][SQL] Implement new ORC data source based on Apache ORC

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17943: [SPARK-20682][SQL] Implement new ORC data source ...

Reply via email to