Re: Use Apache ORC in Apache Spark 2.3

Owen O'Malley Fri, 04 Aug 2017 10:01:56 -0700

The ORC community is really eager to get this work integrated in to Spark
so that Spark users can have fast access to their ORC data. Let us know if
we can help the integration.


Thanks,
   Owen

On Fri, Aug 4, 2017 at 8:05 AM, Dong Joon Hyun <[email protected]>
wrote:

> Hi, All.
>
>
>
> Apache Spark always has been a fast and general engine, and
>
> supports Apache ORC inside `sql/hive` module with Hive dependency since
> Spark 1.4.X (SPARK-2883).
>
> However, there are many open issues about `Feature parity for ORC with
> Parquet (SPARK-20901)` as of today.
>
>
>
> With new Apache ORC 1.4 (released 8th May), Apache Spark is able to get
> the following benefits.
>
>
>
>     - Usability:
>
>         * Users can use `ORC` data sources without hive module (-Phive)
> like `Parquet` format.
>
>
>
>     - Stability & Maintanability:
>
>         * ORC 1.4 already has many fixes.
>
>         * In the future, Spark can upgrade ORC library independently from
> Hive
>            (similar to Parquet library, too)
>
>         * Eventually, reduce the dependecy on old Hive 1.2.1.
>
>
>
>     - Speed:
>
>         * Last but not least, Spark can use both Spark `ColumnarBatch` and
> ORC `RowBatch` together
>
>           which means full vectorization support.
>
>
>
> First of all, I'd love to improve Apache Spark in the following steps in
> the time frame of Spark 2.3.
>
>
>
>     - SPARK-21422: Depend on Apache ORC 1.4.0
>
>     - SPARK-20682: Add a new faster ORC data source based on Apache ORC
>
>     - SPARK-20728: Make ORCFileFormat configurable between sql/hive and
> sql/core
>
>     - SPARK-16060: Vectorized Orc Reader
>
>
>
> I’ve made above PRs since 9th May, the day after Apache ORC 1.4 release,
>
> but the PRs seems to need more attention of PMC since this is an important
> change.
>
> Since the discussion on Apache Spark 2.3 cadence is already started this
> week,
>
> I thought it’s a best time to ask you about this.
>
>
>
> Could anyone of you help me to proceed ORC improvement in Apache Spark
> community?
>
>
>
> Please visit the minimal PR and JIRA issue as a starter.
>
>
>
>    - https://github.com/apache/spark/pull/18640
>    - https://issues.apache.org/jira/browse/SPARK-21422
>
>
>
> Thank you in advance.
>
>
>
> Bests,
>
> Dongjoon Hyun.
>

Re: Use Apache ORC in Apache Spark 2.3

Reply via email to