Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/22453#discussion_r219895047
--- Diff: docs/sql-programming-guide.md ---
@@ -1002,6 +1002,21 @@ Configuration of Parquet can be done using the
`setConf` method on `SparkSession
</p>
</td>
</tr>
+<tr>
+ <td><code>spark.sql.parquet.writeLegacyFormat</code></td>
+ <td>false</td>
+ <td>
+ This configuration indicates whether we should use legacy Parquet
format adopted by Spark 1.4
+ and prior versions or the standard format defined in parquet-format
specification to write
+ Parquet files. This is not only related to compatibility with old
Spark ones, but also other
+ systems like Hive, Impala, Presto, etc. This is especially important
for decimals. If this
+ configuration is not enabled, decimals will be written in int-based
format in Spark 1.5 and
+ above, other systems that only support legacy decimal format (fixed
length byte array) will not
+ be able to read what Spark has written. Note other systems may have
added support for the
+ standard format in more recent versions, which will make this
configuration unnecessary. Please
--- End diff --
I haven't checked closely but I think Hive still uses binary for decimals
(https://github.com/apache/hive/blob/ae008b79b5d52ed6a38875b73025a505725828eb/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java#L503-L541).
Given my past investigation, thing is, Parquet supports both ways to write out
(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal)
IIRC. They deprecated timestamp based on int 96
(https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L782)
but not decimals.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]