Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/20484#discussion_r166231000
--- Diff: docs/sql-programming-guide.md ---
@@ -1776,6 +1776,42 @@ working with timestamps in `pandas_udf`s to get the
best performance, see
## Upgrading From Spark SQL 2.2 to 2.3
+ - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC
file format for ORC files and Hive ORC tables. To do that, the following
configurations are newly added or change their default values.
+
+ - New configurations
+
+ <table class="table">
+ <tr><th><b>Property
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
+ <tr>
+ <td><code>spark.sql.orc.impl</code></td>
+ <td><code>native</code></td>
+ <td>The name of ORC implementation. It can be one of
<code>native</code> and <code>hive</code>. <code>native</code> means the native
ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in
Hive 1.2.1 which is used prior to Spark 2.3.</td>
+ </tr>
+ <tr>
+ <td><code>spark.sql.orc.enableVectorizedReader</code></td>
+ <td><code>true</code></td>
+ <td>Enables vectorized orc decoding in <code>native</code>
implementation. If <code>false</code>, a new non-vectorized ORC reader is used
in <code>native</code> implementation. For <code>hive</code> implementation,
this is ignored.</td>
+ </tr>
+ </table>
+
+ - Changed configurations
+
+ <table class="table">
+ <tr><th><b>Property
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th></tr>
+ <tr>
+ <td><code>spark.sql.orc.filterPushdown</code></td>
+ <td><code>true</code></td>
+ <td>Enables filter pushdown for ORC files. It is
<code>false</code> by default prior to Spark 2.3.</td>
+ </tr>
+ <tr>
+ <td><code>spark.sql.hive.convertMetastoreOrc</code></td>
+ <td><code>true</code></td>
+ <td>Enable the Spark's ORC support, which can be configured by
<code>spark.sql.orc.impl</code>, instead of Hive SerDe when reading from and
writing to Hive ORC tables. It is <code>false</code> by default prior to Spark
2.3.</td>
+ </tr>
+ </table>
+
+ - Since Apache ORC 1.4.1 is a standalone library providing a subset of
Hive ORC related configurations, you can use ORC configuration name and Hive
configuration name. To see a full list of supported ORC configurations, see <a
href="https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/OrcConf.java">OrcConf.java</a>.
--- End diff --
Sorry for late response. @gatorsmile .
Here, your example is a mixed scenario. First of all, I made a PR,
https://github.com/apache/spark/pull/20517, for "Add ORC configuration tests
for ORC data source". It adds a test coverage for ORC and Hive configuration
names for `native` and `hive` OrcFileFormat. The PR aims to focus on name
compatibility for those important confs.
For `convertMetastoreOrc`, the table properties are retained when we check
by using
`spark.sessionState.catalog.getTableMetadata(TableIdentifier(tableName))`.
However, it seems to be ignored on some cases. I guess it also does in Parquet.
I'm working on it separately.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]