HyukjinKwon commented on a change in pull request #24307: [SPARK-25407][SQL]
Ensure we pass a compatible pruned schema to ParquetRowConverter
URL: https://github.com/apache/spark/pull/24307#discussion_r272869533
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetReadSupport.scala
##########
@@ -322,6 +341,32 @@ private[parquet] object ParquetReadSupport {
}
}
+ /**
+ * Computes the structural intersection between two Parquet group types.
+ */
+ private def intersectParquetGroups(
Review comment:
So, am I understanding currently as below?
```
Parquet file schema:
message spark_schema {
required int32 id;
optional group name {
optional binary first (UTF8);
optional binary last (UTF8);
}
optional binary address (UTF8);
}
```
```
Parquet clipped schema:
message spark_schema {
optional group name {
optional binary middle (UTF8);
}
optional binary address (UTF8);
}
```
```
Parquet requested schema:
message spark_schema {
optional binary address (UTF8);
}
```
```
Catalyst requested schema:
root
-- name: struct (nullable = true)
|-- middle: string (nullable = true)
-- address: string (nullable = true)
```
Parquet MR does not support access to the nested non-existent field
(`name.middle`), so we does not request `name` at all and permissively produce
`null` in Spark side?
If so, I think technically this is what Parquet library should support since
Parquet library made a design decision to produce `null` for non-existent
fields.
If those are all correct, I am fine with going ahead with it as a workaround
for a Parquet side issue (should be good to file an issue in Parquet side).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]