[ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-552:
--------------------------------
    Status: Open  (was: New)

> Fix the schema mismatch in Row-to-Avro conversion
> -------------------------------------------------
>
>                 Key: HUDI-552
>                 URL: https://issues.apache.org/jira/browse/HUDI-552
>             Project: Apache Hudi (incubating)
>          Issue Type: Sub-task
>          Components: Spark Integration
>            Reporter: Ethan Guo
>            Assignee: Ethan Guo
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.5.1
>
>         Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png, 
> Screen Shot 2020-01-18 at 12.31.23 AM.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using the `FilebasedSchemaProvider` to provide the source schema in 
> Avro, while ingesting data from `ParquetDFSSource` with the same schema, the 
> DeltaStreamer failed.  A new test case is added below to demonstrate the 
> error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet 
> files in Spark, all fields are automatically [converted to be 
> nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
> for compatibility reasons.  If the source Avro schema has non-null fields, 
> `AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe 
> to convert the Row to Avro record.  The `dataType` has nullable fields based 
> on Spark logic, even though the field names are identical as the source Avro 
> schema.  Thus the resulting Avro records from the conversion have different 
> schema (only nullability difference) compared to the source schema file.  
> Before inserting the records, there are other operations using the source 
> schema file, causing failure of serialization/deserialization because of this 
> schema mismatch.
>  
> The following screenshot shows the modified Avro schema in 
> `AvroConversionUtils.createRdd`.  The original source schema file is:
> !Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!
>  
> !Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to