voonhous opened a new pull request, #18062: URL: https://github.com/apache/hudi/pull/18062
### Describe the issue this Pull Request addresses Closes: [#18037](https://github.com/apache/hudi/issues/18037) This PR implements the configuration plumbing required to support **Shredded Variant** types in the **Spark Record (HoodieRow)** write path. While Hudi supports both Avro-based and Spark Row-based writing, proper support for Spark 4.0 `Variant` shredding requires configuring the underlying Parquet writer correctly. This PR ensures that user-facing Hudi configurations for shredding are correctly propagated to the Hadoop configuration used by `HoodieRowParquetWriteSupport` and that the schema is correctly marked for shredding during Spark Row writes. ### Summary and Changelog This PR introduces new configuration properties to `HoodieStorageConfig` and wires them into `HoodieRowParquetWriteSupport`. It ensures that when users enable shredding (or force a specific shredding schema for testing), these preferences are respected during the Spark Row-based write process. **Key Changes:** 1. **Configuration (`HoodieStorageConfig`)**: * Added `hoodie.parquet.variant.write.shredding.enabled` (default: `true`): Master switch to enable/disable writing shredded variants. * Added `hoodie.parquet.variant.force.shredding.schema.for.test`: Advanced config to force a specific DDL schema for shredding (overriding the natural schema), primarily for testing purposes. * Added `hoodie.parquet.variant.allow.reading.shredded` (default: `true`): Controls whether the reader is allowed to reconstruct shredded variants. 2. **Write Support (`HoodieRowParquetWriteSupport`)**: * **Config Propagation**: In the constructor, the Hudi configurations listed above are read and set into the `hadoopConf` using the corresponding internal Spark keys (e.g., `spark.sql.variant.writeShredding.enabled`). * **Schema Transformation**: Updated `generateShreddedSchema` to: * Respect the "write shredding enabled" flag (returns the original schema if disabled). * Handle the "forced shredding schema" logic: if set, it calls `generateVariantWriteShreddingSchema` on the Spark adapter to apply the forced schema to all variant fields. * Handle standard schema-driven shredding: Checks the `HoodieSchema` for `isShredded()` metadata and correctly maps the fields. ### Impact * **Feature Parity:** Brings the Spark Row write path in parity with the Avro write path regarding Variant shredding support. * **User Control:** Gives users explicit control over whether to write shredded variants and how to handle reading them via Hudi-native configurations. * **Performance:** Enables the performance benefits of shredded variants (column pruning, compression) for users employing the Bulk Insert / Row writing path in Spark. ### Risk Level **Low** * **Config Controlled:** The changes are guarded by feature flags. The default behavior (shredding enabled) aligns with Spark 4.0 defaults, but can be disabled via config. * **Isolation:** Changes are localized to the Parquet write support initialization and schema generation logic. ### Documentation Update * [ ] The new configurations in `HoodieStorageConfig` need to be documented: * `hoodie.parquet.variant.write.shredding.enabled` * `hoodie.parquet.variant.allow.reading.shredded` * `hoodie.parquet.variant.force.shredding.schema.for.test` (Advanced) ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
