voonhous commented on code in PR #18036:
URL: https://github.com/apache/hudi/pull/18036#discussion_r2938555684
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -139,21 +148,153 @@ public HoodieRowParquetWriteSupport(Configuration conf,
StructType structType, O
HoodieSchema parsedSchema = HoodieSchema.parse(schemaString);
return HoodieSchemaUtils.addMetadataFields(parsedSchema,
config.getBooleanOrDefault(ALLOW_OPERATION_METADATA_FIELD));
});
+ // Generate shredded schema if there are shredded Variant columns.
+ // Falls back to the provided schema if no shredded Variant columns are
present.
+ this.shreddedSchema = generateShreddedSchema(structType, schema);
ParquetWriteSupport.setSchema(structType, hadoopConf);
Review Comment:
Yes, this is intentional.
`ParquetWriteSupport.setSchema(structType, hadoopConf)`: Sets the original
Spark schema (with VariantType) into the Hadoop config, which is part of
Spark's internal metadata.
Since we don't use Spark's ParquetWriteSupport for the actual writing. Our
custom `write()` uses `structType` to read from the `InternalRow`, which still
has `VariantType` columns.
`convert(shreddedSchema, schema)` in `init()` builds the actual Parquet
`MessageType` from the shredded schema, which has `VariantType` replaced with
the shredded struct. This determines the file's physical layout.
So the divergence is by design, we read from the row using the original
Spark schema but write to Parquet using the shredded schema. It's done this way
because the internal metadata doesn't provide the shredded schema, it has no
awareness of the shredding structure.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]