Github user dongjoon-hyun commented on a diff in the pull request:
https://github.com/apache/spark/pull/20484#discussion_r165568830
--- Diff: docs/sql-programming-guide.md ---
@@ -1776,6 +1776,77 @@ working with timestamps in `pandas_udf`s to get the
best performance, see
## Upgrading From Spark SQL 2.2 to 2.3
+ - Since Spark 2.3, Spark supports a vectorized ORC reader with a new ORC
file format for ORC files and Hive ORC tables. To do that, the following
configurations are newly added or change their default values.
+
+ <table class="table">
+ <tr>
+ <th>
+ <b>Property Name</b>
+ </th>
+ <th>
+ <b>Default</b>
+ </th>
+ <th>
+ <b>Meaning</b>
+ </th>
+ </tr>
+ <tr>
+ <td>
+ spark.sql.orc.impl
+ </td>
+ <td>
+ native
+ </td>
+ <td>
+ The name of ORC implementation: 'native' means the native ORC
support that is built on Apache ORC 1.4.1 instead of the ORC library in Hive
1.2.1. It is 'hive' by default prior to Spark 2.3.
+ </td>
+ </tr>
+ <tr>
+ <td>
+ spark.sql.orc.enableVectorizedReader
+ </td>
+ <td>
+ true
+ </td>
+ <td>
+ Enables vectorized orc decoding in 'native' implementation. If
'false', a new non-vectorized ORC reader is used in 'native' implementation.
+ </td>
+ </tr>
+ <tr>
+ <td>
+ spark.sql.orc.columnarReaderBatchSize
+ </td>
+ <td>
+ 4096
+ </td>
+ <td>
+ The number of rows to include in a orc vectorized reader batch.
The number should be carefully chosen to minimize overhead and avoid OOMs in
reading data.
+ </td>
+ </tr>
+ <tr>
+ <td>
+ spark.sql.orc.filterPushdown
+ </td>
+ <td>
+ true
+ </td>
+ <td>
+ Enables filter pushdown for ORC files. It is 'false' by default
prior to Spark 2.3.
+ </td>
+ </tr>
+ <tr>
+ <td>
+ spark.sql.hive.convertMetastoreOrc
+ </td>
+ <td>
+ true
+ </td>
+ <td>
+ Enables the built-in ORC reader and writer to process Hive ORC
tables, instead of Hive serde. It is 'false' by default prior to Spark 2.3.
--- End diff --
I borrowed it from the following in the same doc.
> When reading from and writing to Hive metastore Parquet tables, Spark SQL
will try to use its own Parquet support instead of Hive SerDe for better
performance. This behavior is controlled by the
spark.sql.hive.convertMetastoreParquet configuration, and is turned on by
default.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]