Zouxxyy commented on code in PR #49234:
URL: https://github.com/apache/spark/pull/49234#discussion_r1897969339
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala:
##########
@@ -420,6 +420,22 @@ object ParquetUtils extends Logging {
statistics.getNumNulls;
}
+ // Replaces each VariantType in the schema with the corresponding type in
the shredding schema.
+ // Used for testing, where we force a single shredding schema for all
Variant fields.
+ // Does not touch Variant fields nested in arrays, maps, or UDTs.
+ private def replaceVariantTypes(schema: StructType, shreddingSchema:
StructType): StructType = {
+ val newFields = schema.fields.zip(shreddingSchema.fields).map {
+ case (field, shreddingField) =>
+ field.dataType match {
+ case s: StructType =>
+ field.copy(dataType = replaceVariantTypes(s, shreddingSchema))
Review Comment:
@cashmand Thank you for your reply. We have also implemented the shredding
logic internally in Apache Paimon. Specifically, we use `VariantType` along
with an optional `ShreddSchema`. In particular, on the write side, we calculate
the `ShreddSchema` based on the table's shredding props and then set it into
`VariantType`, which is passed to our writer (currently Parquet, but it could
be ORC in the future, etc.).
On the read side, we calculate the required schema based on the scan filter
and projection, which is also written into the optional attribute of
VariantType, and passed to our reader. Combined with the schema of the Parquet
variant, we generate the trimmed read ShreddSchema. I think that shredding is
an extension of the variant type; when our writer sees, oh, there exists a
ShreddSchema, the writer immediately knows that shredding needs to be performed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]