yihua opened a new pull request #1246: [HUDI-552] Fix the schema mismatch in 
Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246
 
 
   ## What is the purpose of the pull request
   
   This PR addresses 
[HUDI-552](https://issues.apache.org/jira/browse/HUDI-552).  When using the 
`FilebasedSchemaProvider` to provide the source/target schema in Avro, while 
ingesting data with the same columns from `RowSource`, the DeltaStreamer 
failed.  The root cause is that when writing parquet files in Spark, all fields 
are automatically converted to be nullable for compatibility reasons.  If the 
source Avro schema has non-null fields, `AvroConversionUtils.createRdd` still 
uses the schema from the Dataframe to convert the Row to Avro record, resulting 
in a different schema (only nullability difference).
   
   To fix this issue, the Avro schema, if exists, is passed to the conversion 
function to reconstruct the correct StructType for conversion.
   
   ## Brief change log
   
     - Passed the Avro schema to `createRdd` to generate the correct StructType 
for conversion in `DeltaSync.readFromSource` and `AvroConversionUtils.createRdd`
     - Added new tests to make sure the logic is correct (before this schema 
fix some of the new tests failed)
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
     - Added tests in `TestHoodieDeltaStreamer` to test the 
`HoodieDeltaStreamer` with `ParquetDFSSource` under different configurations
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to