[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

viirya Sun, 19 Jun 2016 20:05:17 -0700

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/13775


    [SPARK-16060][SQL] Vectorized Orc reader

    ## What changes were proposed in this pull request?
    
    Currently Orc reader in Spark SQL doesn't support vectorized reading. As 
Hive Orc already support vectorization, we can add this support to improve Orc 
reading performance.
    
    ### Benchmark
    
    Benchmark code:
    
        test("Benchmark for Orc") {
          val N = 500 << 12
            withOrcTable((0 until N).map(i => (i, i.toString, i.toLong, 
i.toDouble)), "t") {
              val benchmark = new Benchmark("Orc reader", N)
              benchmark.addCase("reading Orc file", 10) { iter =>
                sql("SELECT * FROM t").collect()
              }
              benchmark.run()
          }
        }
    
    Before this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 
3.19.0-25-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        Orc reader:                              Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        reading Orc file                              4750 / 5266          0.4  
      2319.1       1.0X
    
    After this patch:
    
        Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux 
3.19.0-25-generic
        Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
        Orc reader:                              Best/Avg Time(ms)    Rate(M/s) 
  Per Row(ns)   Relative
        
------------------------------------------------------------------------------------------------
        reading Orc file                              3550 / 3824          0.6  
      1733.2       1.0X
    
    
    
    ## How was this patch tested?
    Existing tests.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 vectorized-orc-reader3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13775.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13775
    
----
commit 2861ac2a5136c065ec38cfc24bf9f979d5b7ae07
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-06-16T02:31:23Z

    Add vectorized Orc reader support.

commit eee8eca70920d624becb43c8510d217ce4d9820b
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-06-17T09:44:11Z

    import.

commit b753d09e3e369fc91a17d9632123dbe40d7d9dfb
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-06-18T10:00:00Z

    If column is repeating, always using row id 0.

commit 7d26f5ed785269299b324df8bfc1c64c2d4a2b48
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-06-19T04:16:49Z

    Fix bugs of getBinary and numFields.

commit 74fe936e522a827384461e445b9ba44f96ce29fe
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-06-20T02:44:07Z

    Remove unnecessary change.

commit 7e7bb6c57860187f391f66ca82cdd715d0b2be43
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-06-20T02:48:11Z

    Remove unnecessary change.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13775: [SPARK-16060][SQL] Vectorized Orc reader

Reply via email to