bbraams commented on a change in pull request #26804:
URL: https://github.com/apache/spark/pull/26804#discussion_r564097098
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
##########
@@ -127,6 +127,9 @@ class ParquetFileFormat
conf.setEnum(ParquetOutputFormat.JOB_SUMMARY_LEVEL, JobSummaryLevel.NONE)
}
+ // PARQUET-1746: Disables page-level CRC checksums by default.
+ conf.setBooleanIfUnset(ParquetOutputFormat.PAGE_WRITE_CHECKSUM_ENABLED,
false)
Review comment:
@wangyum Any chance you could elaborate on this a bit more? Are we
convinced that the issue you pointed out in
https://github.com/apache/spark/pull/26804#discussion_r561044576 is actually a
regression caused by parquet and not a problem with the test itself (e.g.
caused by any non-trivial assumptions made w.r.t. the output files)?
Considering the benefit of having checksums enabled by default (e.g. much
improved visibility into hard to debug data corruption issues), I'd propose
further investigation before disabling the feature entirely and having Spark
diverge from the `parquet-mr` defaults.
Regarding the defaults, support for checksums was added back in
[PARQUET-1580](https://github.com/apache/parquet-mr/pull/647). These changes
were included and released with `parquet-mr` 1.11.0 (see
[CHANGES](https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.0/CHANGES.md#version-1110)),
and writing out checksums has been enabled by default since the release, see
`ParquetProperties.java` in:
*
[master](https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L61)
*
[apache-parquet-1.11.0](https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.0/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54)
*
[apache-parquet-1.11.1](https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L54)
I also noticed that
[PARQUET-1746](https://issues.apache.org/jira/browse/PARQUET-1746) was raised
and [a PR](https://github.com/apache/parquet-mr/pull/857) was opened for it to
set the default to `false`, but that the issue has already been marked as
resolved and the PR closed without merging the changes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]