[GitHub] [spark] gszadovszky commented on a change in pull request #26804: [SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1

GitBox Thu, 21 Jan 2021 21:45:38 -0800


gszadovszky commented on a change in pull request #26804:
URL: https://github.com/apache/spark/pull/26804#discussion_r561694102




##########
File path: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamSuite.scala
##########
@@ -225,7 +226,9 @@ class StreamSuite extends StreamTest {
 
     val df = spark.readStream.format(classOf[FakeDefaultSource].getName).load()
     Seq("", "parquet").foreach { useV1Source =>
-      withSQLConf(SQLConf.USE_V1_SOURCE_LIST.key -> useV1Source) {
+      withSQLConf(
+        SQLConf.USE_V1_SOURCE_LIST.key -> useV1Source,
+        ParquetOutputFormat.PAGE_WRITE_CHECKSUM_ENABLED -> "false") {

Review comment:
       @wangyum, I've checked the code change of PARQUET-1580 (again) and still 
don't understand why it would cause such an issue. By disabling the CRC write 
you only achieve to not to write an optional field in the page headers. It 
should not impact any kind of ordering. If it really does it means that this 
ordering relies on some parameters that it shouldn't. It also means that any 
other potential change in the file metadata might impact this ordering.
   Maybe I'm overlooking something in our code base so any comment is welcomed 
but if not I would suggest revisiting these unit tests.
   
   Meanwhile, I am not experienced in Spark code so if you are fine with this 
workaround in a unit test I am not against it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] gszadovszky commented on a change in pull request #26804: [SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1

Reply via email to