GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/13775
[SPARK-16060][SQL] Vectorized Orc reader
## What changes were proposed in this pull request?
Currently Orc reader in Spark SQL doesn't support vectorized reading. As
Hive Orc already support vectorization, we can add this support to improve Orc
reading performance.
### Benchmark
Benchmark code:
test("Benchmark for Orc") {
val N = 500 << 12
withOrcTable((0 until N).map(i => (i, i.toString, i.toLong,
i.toDouble)), "t") {
val benchmark = new Benchmark("Orc reader", N)
benchmark.addCase("reading Orc file", 10) { iter =>
sql("SELECT * FROM t").collect()
}
benchmark.run()
}
}
Before this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux
3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Orc file 4750 / 5266 0.4
2319.1 1.0X
After this patch:
Java HotSpot(TM) 64-Bit Server VM 1.8.0_71-b15 on Linux
3.19.0-25-generic
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
Orc reader: Best/Avg Time(ms) Rate(M/s)
Per Row(ns) Relative
------------------------------------------------------------------------------------------------
reading Orc file 3550 / 3824 0.6
1733.2 1.0X
## How was this patch tested?
Existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 vectorized-orc-reader3
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/13775.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #13775
----
commit 2861ac2a5136c065ec38cfc24bf9f979d5b7ae07
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-06-16T02:31:23Z
Add vectorized Orc reader support.
commit eee8eca70920d624becb43c8510d217ce4d9820b
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-06-17T09:44:11Z
import.
commit b753d09e3e369fc91a17d9632123dbe40d7d9dfb
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-06-18T10:00:00Z
If column is repeating, always using row id 0.
commit 7d26f5ed785269299b324df8bfc1c64c2d4a2b48
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-06-19T04:16:49Z
Fix bugs of getBinary and numFields.
commit 74fe936e522a827384461e445b9ba44f96ce29fe
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-06-20T02:44:07Z
Remove unnecessary change.
commit 7e7bb6c57860187f391f66ca82cdd715d0b2be43
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-06-20T02:48:11Z
Remove unnecessary change.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]