Github user mallman commented on a diff in the pull request:
https://github.com/apache/spark/pull/16578#discussion_r148722822
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala
---
@@ -127,8 +127,8 @@ private[parquet] class ParquetRowConverter(
extends ParquetGroupConverter(updater) with Logging {
assert(
- parquetType.getFieldCount == catalystType.length,
- s"""Field counts of the Parquet schema and the Catalyst schema don't
match:
+ parquetType.getFieldCount <= catalystType.length,
--- End diff --
In `ParquetReadSupport.scala`, when `parquetMrCompatibility` is `true`, we
intersect the clipped parquet schema with the underlying parquet file's schema.
This can result in a requested parquet schema with fewer fields than the
requested catalyst schema.
For example, in the case of a partitioned table where we select a column
which doesn't exist in the schema of one partition's files, we will remove the
missing columns from the requested parquet schema.
This scenario is illustrated and tested by the "partial schema intersection
- select missing subfield" test in `ParquetSchemaPruningSuite.scala`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]