yujhe commented on PR #40091:
URL: https://github.com/apache/spark/pull/40091#issuecomment-1786538476

   We found that this happens if we are reading Parquet file with nested 
columns in schema.
   
   ```scala
   val path = "/tmp/parquet_zstd"
   (1 to 100).map(i => (i, Seq(i)))
     .toDF("id", "value")
     .repartition(1)
     .write
     .mode("overwrite")
     .parquet(path)
   
   val df = spark.read.parquet(path)
   df.write.mode("overwrite").parquet("/tmp/dummy")
   ```
   
   After tracing the code, 
[ParquetCodecFactory](https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java#L40)
 only applies to 
[VectorizedParquetRecordReader](https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L171)
 with [vectorized reader 
enabled](https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L355).
 
   
   However, for schema with nested columns, the vectorized reader is disabled 
by default in Spark 3.3 
(`spark.sql.parquet.enableNestedColumnVectorizedReader=false`). Therefore, this 
workaround does not work in this case. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to