cashmand commented on code in PR #49234:
URL: https://github.com/apache/spark/pull/49234#discussion_r1897960988
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala:
##########
@@ -420,6 +420,22 @@ object ParquetUtils extends Logging {
statistics.getNumNulls;
}
+ // Replaces each VariantType in the schema with the corresponding type in
the shredding schema.
+ // Used for testing, where we force a single shredding schema for all
Variant fields.
+ // Does not touch Variant fields nested in arrays, maps, or UDTs.
+ private def replaceVariantTypes(schema: StructType, shreddingSchema:
StructType): StructType = {
+ val newFields = schema.fields.zip(shreddingSchema.fields).map {
+ case (field, shreddingField) =>
+ field.dataType match {
+ case s: StructType =>
+ field.copy(dataType = replaceVariantTypes(s, shreddingSchema))
Review Comment:
Hi @Zouxxyy, that's actually what I tried initially. I could go back to that
approach if there's a consensus that it's better, but I felt that it had some
downsides. Specifically, the approach I took was to add an optional
`shreddingSchema` property to the VariantType class, but leave the `case object
VariantType` to represent a VariantType with no shredding (similar to what's
been done for StringType with default collation), to avoid having to modify all
of the existing code that pattern matches to `VariantType`.
I didn't think this approach added much value - it's not very clear what it
means for the type to have a shredding schema outside of the specific context
of the couple of writer classes that care about it, and it felt like it was
just pushing writer-specific code into the core of the type system. E.g. when
serializing the type to a string, should we include the shredding schema? I
think the answer is that we'd want to when sending it to ParquetWriteSupport,
but not when writing it to the parquet file (since it's not part of the Spark
type when reading it back from Parquet). It seemed better to resolve these
ambiguities explicitly in the relevant writer code, rather than making them
part of the data type definition.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]