[GitHub] spark pull request #20484: [SPARK-23313][DOC] Add a migration guide for ORC

gatorsmile Sat, 03 Feb 2018 09:35:09 -0800

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20484#discussion_r165820198
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1776,6 +1776,42 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
     
     ## Upgrading From Spark SQL 2.2 to 2.3
     
    +  - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC 
file format for ORC files and Hive ORC tables. To do that, the following 
configurations are newly added or change their default values.
    +
    +    - New configurations
    +
    +    <table class="table">
    +      <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +      <tr>
    +        <td><code>spark.sql.orc.impl</code></td>
    +        <td><code>native</code></td>
    +        <td>The name of ORC implementation. It can be one of 
<code>native</code> and <code>hive</code>. <code>native</code> means the native 
ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in 
Hive 1.2.1 which is used prior to Spark 2.3.</td>
    +      </tr>
    +      <tr>
    +        <td><code>spark.sql.orc.enableVectorizedReader</code></td>
    +        <td><code>true</code></td>
    +        <td>Enables vectorized orc decoding in <code>native</code> 
implementation. If <code>false</code>, a new non-vectorized ORC reader is used 
in <code>native</code> implementation. For <code>hive</code> implementation, 
this is ignored.</td>
    +      </tr>
    +    </table>
    +
    +    - Changed configurations
    +
    +    <table class="table">
    +      <tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
    +      <tr>
    +        <td><code>spark.sql.orc.filterPushdown</code></td>
    +        <td><code>true</code></td>
    +        <td>Enables filter pushdown for ORC files. It is 
<code>false</code> by default prior to Spark 2.3.</td>
    +      </tr>
    +      <tr>
    +        <td><code>spark.sql.hive.convertMetastoreOrc</code></td>
    +        <td><code>true</code></td>
    +        <td>Enable the Spark's ORC support, which can be configured by 
<code>spark.sql.orc.impl</code>, instead of Hive SerDe when reading from and 
writing to Hive ORC tables. It is <code>false</code> by default prior to Spark 
2.3.</td>
    +      </tr>
    +    </table>
    +
    +    - Since Apache ORC 1.4.1 is a standalone library providing a subset of 
Hive ORC related configurations, you can use ORC configuration name and Hive 
configuration name. To see a full list of supported ORC configurations, see <a 
href="https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/OrcConf.java";>OrcConf.java</a>.
    --- End diff --
    
    Yes. We can check whether some important conf works 
    
    For example, 
    ```SQL
    create table if not exists vectororc (s1 string, s2 string)
    stored as ORC tblproperties(
      "orc.row.index.stride"="1000", 
      "hive.exec.orc.default.stripe.size"="100000",
       "orc.compress.size"="10000");
    ```
    
    After auto conversion, do these confs in tblproperties are still being used 
by our native readers?
    
    We also need to check whether these confs work well?  We also need to check 
whether the confs set in configuration are also recognized by our native 
readers.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20484: [SPARK-23313][DOC] Add a migration guide for ORC

Reply via email to