yihua opened a new pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion URL: https://github.com/apache/incubator-hudi/pull/1246 ## What is the purpose of the pull request This PR addresses [HUDI-552](https://issues.apache.org/jira/browse/HUDI-552). When using the `FilebasedSchemaProvider` to provide the source/target schema in Avro, while ingesting data with the same columns from `RowSource`, the DeltaStreamer failed. The root cause is that when writing parquet files in Spark, all fields are automatically converted to be nullable for compatibility reasons. If the source Avro schema has non-null fields, `AvroConversionUtils.createRdd` still uses the schema from the Dataframe to convert the Row to Avro record, resulting in a different schema (only nullability difference). To fix this issue, the Avro schema, if exists, is passed to the conversion function to reconstruct the correct StructType for conversion. ## Brief change log - Passed the Avro schema to `createRdd` to generate the correct StructType for conversion in `DeltaSync.readFromSource` and `AvroConversionUtils.createRdd` - Added new tests to make sure the logic is correct (before this schema fix some of the new tests failed) ## Verify this pull request This change added tests and can be verified as follows: - Added tests in `TestHoodieDeltaStreamer` to test the `HoodieDeltaStreamer` with `ParquetDFSSource` under different configurations ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
