prashantwason opened a new pull request, #17777:
URL: https://github.com/apache/hudi/pull/17777

   ### Describe the issue this Pull Request addresses
   
   This PR adds an option to ensure all columns in the schema are nullable when 
using HoodieStreamer with row-based sources like `SQLSource` or 
`SQLFileBasedSource`.
   
   Issue: [HUDI-6105](https://issues.apache.org/jira/browse/HUDI-6105)
   
   When new columns are added via SQL queries, the schema must be backwards 
compatible. New columns added to a table must be nullable because existing 
records don't have values for them. This change provides a configuration option 
to automatically make all columns nullable, ensuring smooth schema evolution.
   
   ### Summary and Changelog
   
   **What users gain:** Users can now set 
`hoodie.deltastreamer.transformed.row.nullable=true` to automatically make all 
columns in the incoming schema nullable, preventing schema compatibility issues 
during schema evolution.
   
   **Changes:**
   - Added new configuration constants in `HoodieStreamer.java`:
     - `ENSURE_ALL_COLUMNS_NULLABLE_KEY = 
"hoodie.deltastreamer.transformed.row.nullable"`
     - `ENSURE_ALL_COLUMNS_NULLABLE_DEFAULT = false`
   - Added `extractSchemaFromDataset()` method in `UtilHelpers.java` that 
optionally converts schema to nullable using Spark's `StructType.asNullable()`
   - Updated `RowSource.java` to use the new schema extraction method
   - Updated `StreamSync.java` to use the new schema extraction method for 
transformed datasets
   
   ### Impact
   
   - New configuration option `hoodie.deltastreamer.transformed.row.nullable` 
(default: `false`)
   - No breaking changes - existing behavior is preserved when config is not set
   - When enabled, all columns in the dataset schema are converted to nullable 
before being used
   
   ### Risk Level
   
   low - The feature is disabled by default and only affects schema handling 
when explicitly enabled. The implementation uses Spark's built-in 
`asNullable()` method which is well-tested.
   
   ### Documentation Update
   
   The new config `hoodie.deltastreamer.transformed.row.nullable` should be 
documented:
   - **Key:** `hoodie.deltastreamer.transformed.row.nullable`
   - **Default:** `false`
   - **Description:** When set to `true`, all columns in the incoming dataset 
schema are made nullable. This is useful for maintaining backwards 
compatibility when new columns are added via SQL queries.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to