[GitHub] spark pull request #20484: [SPARK-23313][DOC] Add a migration guide for ORC

dongjoon-hyun Thu, 01 Feb 2018 22:46:51 -0800

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20484#discussion_r165568830
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1776,6 +1776,77 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
     
     ## Upgrading From Spark SQL 2.2 to 2.3
     
    +  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files and Hive ORC tables. To do that, the following 
configurations are newly added or change their default values.
    +
    +    <table class="table">
    +      <tr>
    +        <th>
    +          <b>Property Name</b>
    +        </th>
    +        <th>
    +          <b>Default</b>
    +        </th>
    +        <th>
    +          <b>Meaning</b>
    +        </th>
    +      </tr>
    +      <tr>
    +        <td>
    +          spark.sql.orc.impl
    +        </td>
    +        <td>
    +          native
    +        </td>
    +        <td>
    +          The name of ORC implementation: 'native' means the native ORC 
support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive 
1.2.1. It is 'hive' by default prior to Spark 2.3.
    +        </td>
    +      </tr>
    +      <tr>
    +        <td>
    +          spark.sql.orc.enableVectorizedReader
    +        </td>
    +        <td>
    +          true
    +        </td>
    +        <td>
    +          Enables vectorized orc decoding in 'native' implementation. If 
'false', a new non-vectorized ORC reader is used in 'native' implementation.
    +        </td>
    +      </tr>
    +      <tr>
    +        <td>
    +          spark.sql.orc.columnarReaderBatchSize
    +        </td>
    +        <td>
    +          4096
    +        </td>
    +        <td>
    +          The number of rows to include in a orc vectorized reader batch. 
The number should be carefully chosen to minimize overhead and avoid OOMs in 
reading data.
    +        </td>
    +      </tr>
    +      <tr>
    +        <td>
    +          spark.sql.orc.filterPushdown
    +        </td>
    +        <td>
    +          true
    +        </td>
    +        <td>
    +          Enables filter pushdown for ORC files. It is 'false' by default 
prior to Spark 2.3.
    +        </td>
    +      </tr>
    +      <tr>
    +        <td>
    +          spark.sql.hive.convertMetastoreOrc
    +        </td>
    +        <td>
    +          true
    +        </td>
    +        <td>
    +          Enables the built-in ORC reader and writer to process Hive ORC 
tables, instead of Hive serde. It is 'false' by default prior to Spark 2.3.
    --- End diff --
    
    I borrowed it from the following in the same doc.
    > When reading from and writing to Hive metastore Parquet tables, Spark SQL 
will try to use its own Parquet support instead of Hive SerDe for better 
performance. This behavior is controlled by the 
spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
default.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20484: [SPARK-23313][DOC] Add a migration guide for ORC

Reply via email to