Re: [PR] [SPARK-50614][SQL] Add Variant shredding support for Parquet [spark]

via GitHub Thu, 26 Dec 2024 06:43:47 -0800


cashmand commented on code in PR #49234:
URL: https://github.com/apache/spark/pull/49234#discussion_r1897960988



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetUtils.scala:
##########
@@ -420,6 +420,22 @@ object ParquetUtils extends Logging {
     statistics.getNumNulls;
   }
 
+  // Replaces each VariantType in the schema with the corresponding type in 
the shredding schema.
+  // Used for testing, where we force a single shredding schema for all 
Variant fields.
+  // Does not touch Variant fields nested in arrays, maps, or UDTs.
+  private def replaceVariantTypes(schema: StructType, shreddingSchema: 
StructType): StructType = {
+    val newFields = schema.fields.zip(shreddingSchema.fields).map {
+      case (field, shreddingField) =>
+        field.dataType match {
+          case s: StructType =>
+            field.copy(dataType = replaceVariantTypes(s, shreddingSchema))

Review Comment:
   Hi @Zouxxyy, that's actually what I tried initially. I could go back to that 
approach if there's a consensus that it's better, but I felt that it had some 
downsides. Specifically, the approach I took was to add an optional 
`shreddingSchema` property to the VariantType class, but leave the `case object 
VariantType` to represent a VariantType with no shredding (similar to what's 
been done for StringType with default collation), to avoid having to modify all 
of the existing code that pattern matches to `VariantType`.
   
   I didn't think this approach added much value - it's not very clear what it 
means for the type to have a shredding schema outside of the specific context 
of the couple of writer classes that care about it, and it felt like it was 
just pushing writer-specific code into the core of the type system. E.g. when 
serializing the type to a string, should we include the shredding schema? I 
think the answer is that we'd want to when sending it to ParquetWriteSupport, 
but not when writing it to the parquet file (since it's not part of the Spark 
type when reading it back from Parquet). It seemed better to resolve these 
ambiguities explicitly in the relevant writer code, rather than making them 
part of the data type definition.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-50614][SQL] Add Variant shredding support for Parquet [spark]

Reply via email to