prashantwason opened a new pull request, #17777: URL: https://github.com/apache/hudi/pull/17777
### Describe the issue this Pull Request addresses This PR adds an option to ensure all columns in the schema are nullable when using HoodieStreamer with row-based sources like `SQLSource` or `SQLFileBasedSource`. Issue: [HUDI-6105](https://issues.apache.org/jira/browse/HUDI-6105) When new columns are added via SQL queries, the schema must be backwards compatible. New columns added to a table must be nullable because existing records don't have values for them. This change provides a configuration option to automatically make all columns nullable, ensuring smooth schema evolution. ### Summary and Changelog **What users gain:** Users can now set `hoodie.deltastreamer.transformed.row.nullable=true` to automatically make all columns in the incoming schema nullable, preventing schema compatibility issues during schema evolution. **Changes:** - Added new configuration constants in `HoodieStreamer.java`: - `ENSURE_ALL_COLUMNS_NULLABLE_KEY = "hoodie.deltastreamer.transformed.row.nullable"` - `ENSURE_ALL_COLUMNS_NULLABLE_DEFAULT = false` - Added `extractSchemaFromDataset()` method in `UtilHelpers.java` that optionally converts schema to nullable using Spark's `StructType.asNullable()` - Updated `RowSource.java` to use the new schema extraction method - Updated `StreamSync.java` to use the new schema extraction method for transformed datasets ### Impact - New configuration option `hoodie.deltastreamer.transformed.row.nullable` (default: `false`) - No breaking changes - existing behavior is preserved when config is not set - When enabled, all columns in the dataset schema are converted to nullable before being used ### Risk Level low - The feature is disabled by default and only affects schema handling when explicitly enabled. The implementation uses Spark's built-in `asNullable()` method which is well-tested. ### Documentation Update The new config `hoodie.deltastreamer.transformed.row.nullable` should be documented: - **Key:** `hoodie.deltastreamer.transformed.row.nullable` - **Default:** `false` - **Description:** When set to `true`, all columns in the incoming dataset schema are made nullable. This is useful for maintaining backwards compatibility when new columns are added via SQL queries. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
