voonhous opened a new pull request, #18062:
URL: https://github.com/apache/hudi/pull/18062

   ### Describe the issue this Pull Request addresses
   
   Closes: [#18037](https://github.com/apache/hudi/issues/18037)
   
   This PR implements the configuration plumbing required to support **Shredded 
Variant** types in the **Spark Record (HoodieRow)** write path.
   
   While Hudi supports both Avro-based and Spark Row-based writing, proper 
support for Spark 4.0 `Variant` shredding requires configuring the underlying 
Parquet writer correctly. This PR ensures that user-facing Hudi configurations 
for shredding are correctly propagated to the Hadoop configuration used by 
`HoodieRowParquetWriteSupport` and that the schema is correctly marked for 
shredding during Spark Row writes.
   
   ### Summary and Changelog
   
   This PR introduces new configuration properties to `HoodieStorageConfig` and 
wires them into `HoodieRowParquetWriteSupport`. It ensures that when users 
enable shredding (or force a specific shredding schema for testing), these 
preferences are respected during the Spark Row-based write process.
   
   **Key Changes:**
   
   1.  **Configuration (`HoodieStorageConfig`)**:
       * Added `hoodie.parquet.variant.write.shredding.enabled` (default: 
`true`): Master switch to enable/disable writing shredded variants.
       * Added `hoodie.parquet.variant.force.shredding.schema.for.test`: 
Advanced config to force a specific DDL schema for shredding (overriding the 
natural schema), primarily for testing purposes.
       * Added `hoodie.parquet.variant.allow.reading.shredded` (default: 
`true`): Controls whether the reader is allowed to reconstruct shredded 
variants.
   
   2.  **Write Support (`HoodieRowParquetWriteSupport`)**:
       * **Config Propagation**: In the constructor, the Hudi configurations 
listed above are read and set into the `hadoopConf` using the corresponding 
internal Spark keys (e.g., `spark.sql.variant.writeShredding.enabled`).
       * **Schema Transformation**: Updated `generateShreddedSchema` to:
           * Respect the "write shredding enabled" flag (returns the original 
schema if disabled).
           * Handle the "forced shredding schema" logic: if set, it calls 
`generateVariantWriteShreddingSchema` on the Spark adapter to apply the forced 
schema to all variant fields.
           * Handle standard schema-driven shredding: Checks the `HoodieSchema` 
for `isShredded()` metadata and correctly maps the fields.
   
   ### Impact
   
   * **Feature Parity:** Brings the Spark Row write path in parity with the 
Avro write path regarding Variant shredding support.
   * **User Control:** Gives users explicit control over whether to write 
shredded variants and how to handle reading them via Hudi-native configurations.
   * **Performance:** Enables the performance benefits of shredded variants 
(column pruning, compression) for users employing the Bulk Insert / Row writing 
path in Spark.
   
   ### Risk Level
   
   **Low**
   
   * **Config Controlled:** The changes are guarded by feature flags. The 
default behavior (shredding enabled) aligns with Spark 4.0 defaults, but can be 
disabled via config.
   * **Isolation:** Changes are localized to the Parquet write support 
initialization and schema generation logic.
   
   ### Documentation Update
   
   * [ ] The new configurations in `HoodieStorageConfig` need to be documented:
       * `hoodie.parquet.variant.write.shredding.enabled`
       * `hoodie.parquet.variant.allow.reading.shredded`
       * `hoodie.parquet.variant.force.shredding.schema.for.test` (Advanced)
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to