yujhe commented on PR #40091:
URL: https://github.com/apache/spark/pull/40091#issuecomment-1786538476
We found that this happens if we are reading Parquet file with nested
columns in schema.
```scala
val path = "/tmp/parquet_zstd"
(1 to 100).map(i => (i, Seq(i)))
.toDF("id", "value")
.repartition(1)
.write
.mode("overwrite")
.parquet(path)
val df = spark.read.parquet(path)
df.write.mode("overwrite").parquet("/tmp/dummy")
```
After tracing the code,
[ParquetCodecFactory](https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetCodecFactory.java#L40)
only applies to
[VectorizedParquetRecordReader](https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L171)
with [vectorized reader
enabled](https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L355).
However, for schema with nested columns, the vectorized reader is disabled
by default in Spark 3.3
(`spark.sql.parquet.enableNestedColumnVectorizedReader=false`). Therefore, this
workaround does not work in this case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]