Awesome, Dong Joon, It's a great improvement. Looking forward its merge.
Dong Joon Hyun <dh...@hortonworks.com>于2017年7月12日周三 上午6:53写道: > Hi, All. > > > > Since Apache Spark 2.2 vote passed successfully last week, > > I think it’s a good time for me to ask your opinions again about the > following PR. > > > > https://github.com/apache/spark/pull/17980 (+3,887, −86) > > > > It’s for the following issues. > > > > - SPARK-20728: Make ORCFileFormat configurable between sql/hive and > sql/core > - SPARK-20682: Support a new faster ORC data source based on Apache ORC > > > > Basically, the approach is trying to use the latest Apache ORC 1.4.0 > officially. > > You can switch between the legacy ORC data source and new ORC datasource. > > > > Could you help me to progress this in order to improve Apache Spark 2.3? > > > > Bests, > > Dongjoon. > > > > *From: *Dong Joon Hyun <dh...@hortonworks.com> > > > *Date: *Tuesday, May 9, 2017 at 6:15 PM > *To: *"dev@spark.apache.org" <dev@spark.apache.org> > *Subject: *Faster Spark on ORC with Apache ORC > > > > Hi, All. > > > > Apache Spark always has been a fast and general engine, and > > since SPARK-2883, Spark supports Apache ORC inside `sql/hive` module with > Hive dependency. > > > > With Apache ORC 1.4.0 (released yesterday), we can make Spark on ORC > faster and get some benefits. > > > > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together > which means full vectorization support. > > > > - Stability: Apache ORC 1.4.0 already has many fixes and we can depend > on ORC community effort in the future. > > > > - Usability: Users can use `ORC` data sources without hive module > (-Phive) > > > > - Maintainability: Reduce the Hive dependency and eventually remove > some old legacy code from `sql/hive` module. > > > > As a first step, I made a PR adding a new ORC data source into `sql/core` > module. > > > > https://github.com/apache/spark/pull/17924 (+ 3,691 lines, -0) > > > > Could you give some opinions on this approach? > > > > Bests, > > Dongjoon. >