GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/17943
[SPARK-20682][SQL] Implement new ORC data source based on Apache ORC ## What changes were proposed in this pull request? Since [SPARK-2883](https://issues.apache.org/jira/browse/SPARK-2883), Apache Spark supports Apache ORC inside `sql/hive` module with Hive dependency. This issue aims to add a new and faster ORC data source inside `sql/core` and to replace the old ORC data source eventually. In this issue, the latest [Apache ORC 1.4.0](https://orc.apache.org/news/2017/05/08/ORC-1.4.0/) (released yesterday) is used. There are four key benefits. - **Speed**: Use both Spark `ColumnarBatch` and ORC `RowBatch` together later. In this PR, only `RowBatch` is used. This is faster than the current implementation in Spark. For `ColumnarBatch`, we need to benchmark and choose the fastest way to use it later. (Please refer some discussion on #17924) - **Stability**: Apache ORC 1.4.0 has many fixes and we can depend on ORC community more. - **Usability**: User can use `ORC` data sources without hive module, i.e, `-Phive`. - **Maintainability**: Reduce the Hive dependency and can remove old legacy code later. The followings are two examples of comparisons in `OrcReadBenchmark.scala`. ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.12.4 Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz SQL Single Int Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ SQL ORC Vectorized Reader 278 / 320 56.5 17.7 1.0X SQL ORC MR Reader 348 / 358 45.2 22.1 0.8X HIVE ORC MR Reader 418 / 430 37.6 26.6 0.7X Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ SQL Read data column 273 / 283 57.6 17.4 1.0X SQL Read partition column 252 / 266 62.5 16.0 1.1X SQL Read both columns 283 / 293 55.5 18.0 1.0X HIVE Read data column 510 / 520 30.8 32.4 0.5X HIVE Read partition column 420 / 425 37.5 26.7 0.7X HIVE Read both columns 527 / 538 29.9 33.5 0.5X ``` ## How was this patch tested? Pass the Jenkins tests with newly added test suites in `sql/core`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-20682-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17943.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17943 ---- commit 70bc00e3695ddec164ce626602a5e7f4b425f780 Author: Dongjoon Hyun <dongj...@apache.org> Date: 2017-04-25T19:30:24Z [SPARK-20682][SQL] Implement new ORC data source based on Apache ORC ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org