Re: [PR] feat(schema): Add support to write shredded variants for HoodieRecordType.SPARK [hudi]

via GitHub Mon, 16 Mar 2026 00:13:26 -0700


voonhous commented on code in PR #18036:
URL: https://github.com/apache/hudi/pull/18036#discussion_r2938555684



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -139,21 +148,153 @@ public HoodieRowParquetWriteSupport(Configuration conf, 
StructType structType, O
           HoodieSchema parsedSchema = HoodieSchema.parse(schemaString);
           return HoodieSchemaUtils.addMetadataFields(parsedSchema, 
config.getBooleanOrDefault(ALLOW_OPERATION_METADATA_FIELD));
         });
+    // Generate shredded schema if there are shredded Variant columns.
+    // Falls back to the provided schema if no shredded Variant columns are 
present.
+    this.shreddedSchema = generateShreddedSchema(structType, schema);
     ParquetWriteSupport.setSchema(structType, hadoopConf);

Review Comment:
   Yes, this is intentional.
   
   `ParquetWriteSupport.setSchema(structType, hadoopConf)`: Sets the original 
Spark schema (with VariantType) into the Hadoop config, which is part of 
Spark's internal metadata. 
   
   Since we don't use Spark's ParquetWriteSupport for the actual writing. Our 
custom `write()` uses `structType` to read from the `InternalRow`, which still 
has `VariantType` columns.
   
   `convert(shreddedSchema, schema)` in `init()` builds the actual Parquet 
`MessageType` from the shredded schema, which has `VariantType` replaced with 
the shredded struct. This determines the file's physical layout.
   
   So the divergence is by design, we read from the row using the original 
Spark schema but write to Parquet using the shredded schema. It's done this way 
because the internal metadata doesn't provide the shredded schema, it has no 
awareness of the shredding structure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(schema): Add support to write shredded variants for HoodieRecordType.SPARK [hudi]

Reply via email to