[GitHub] spark pull request #20484: [SPARK-23313][DOC] Add a migration guide for ORC

gatorsmile Fri, 02 Feb 2018 16:07:51 -0800

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20484#discussion_r165791487
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1776,6 +1776,42 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
     
     ## Upgrading From Spark SQL 2.2 to 2.3
     
    +  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files and Hive ORC tables. To do that, the following 
configurations are newly added or change their default values.
    +
    +    - New configurations
    +
    +    <table class="table">
    +      <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +      <tr>
    +        <td><code>spark.sql.orc.impl</code></td>
    +        <td><code>native</code></td>
    +        <td>The name of ORC implementation. It can be one of 
<code>native</code> and <code>hive</code>. <code>native</code> means the native 
ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in 
Hive 1.2.1 which is used prior to Spark 2.3.</td>
    +      </tr>
    +      <tr>
    +        <td><code>spark.sql.orc.enableVectorizedReader</code></td>
    +        <td><code>true</code></td>
    +        <td>Enables vectorized orc decoding in <code>native</code> 
implementation. If <code>false</code>, a new non-vectorized ORC reader is used 
in <code>native</code> implementation. For <code>hive</code> implementation, 
this is ignored.</td>
    +      </tr>
    +    </table>
    +
    +    - Changed configurations
    +
    +    <table class="table">
    +      <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +      <tr>
    +        <td><code>spark.sql.orc.filterPushdown</code></td>
    +        <td><code>true</code></td>
    +        <td>Enables filter pushdown for ORC files. It is 
<code>false</code> by default prior to Spark 2.3.</td>
    +      </tr>
    +      <tr>
    +        <td><code>spark.sql.hive.convertMetastoreOrc</code></td>
    +        <td><code>true</code></td>
    +        <td>Enable the Spark's ORC support, which can be configured by 
<code>spark.sql.orc.impl</code>, instead of Hive SerDe when reading from and 
writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 
2.3.</td>
    +      </tr>
    +    </table>
    +
    +    - Since Apache ORC 1.4.1 is a standalone library providing a subset of 
Hive ORC related configurations, you can use ORC configuration name and Hive 
configuration name. To see a full list of supported ORC configurations, see <a 
href="https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/OrcConf.java";>OrcConf.java</a>.
    --- End diff --
    
    We might need to explicitly mention they need to specify the corresponding 
ORC configuration names when they explicitly or implicitly use the native 
readers.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20484: [SPARK-23313][DOC] Add a migration guide for ORC

Reply via email to