[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...

seancxmao Tue, 25 Sep 2018 19:04:59 -0700

Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22453#discussion_r220407692

--- Diff: docs/sql-programming-guide.md ---
@@ -1002,6 +1002,21 @@ Configuration of Parquet can be done using the
`setConf` method on `SparkSession
</p>
</td>
</tr>
+<tr>
+ <td><code>spark.sql.parquet.writeLegacyFormat</code></td>
+ <td>false</td>
+ <td>
+ This configuration indicates whether we should use legacy Parquet
format adopted by Spark 1.4
+ and prior versions or the standard format defined in parquet-format
specification to write
+ Parquet files. This is not only related to compatibility with old
Spark ones, but also other
+ systems like Hive, Impala, Presto, etc. This is especially important
for decimals. If this
+ configuration is not enabled, decimals will be written in int-based
format in Spark 1.5 and
+ above, other systems that only support legacy decimal format (fixed
length byte array) will not
+ be able to read what Spark has written. Note other systems may have
added support for the
+ standard format in more recent versions, which will make this
configuration unnecessary. Please
--- End diff --

If we must call it "legacy", I'd think of it legacy implementation in Spark
side, rather than legacy format in Parquet side.
As comment in
[SPARK-20297](https://issues.apache.org/jira/browse/SPARK-20297?focusedCommentId=15975559&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15975559)
> The standard doesn't say that smaller decimals have to be stored in
int32/int64, it just is an option for subset of decimal types. int32 and int64
are valid representations for a subset of decimal types. fixed_len_byte_array
and binary are a valid representation of any decimal type.
>
>The int32/int64 options were present in the original version of the
decimal spec, they just weren't widely implemented. So its not a new/old
version thing, it was just an alternative representation that many systems
didn't implement.

Anyway, it really leads to confusion.

Really appreciate your suggestion @srowen to make the doc shorter, the doc
you suggested is more concise and to the point.

One more thing I want to discuss. After investigating the usage of this
option, I found this option is not only related to decimals, but also complex
types (Array, Map), see below source code. Should we mention this in the doc?

https://github.com/apache/spark/blob/473d0d862de54ec1c7a8f0354fa5e06f3d66e455/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L450-L458




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...

Reply via email to