[
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-552:
--------------------------------
Status: Open (was: New)
> Fix the schema mismatch in Row-to-Avro conversion
> -------------------------------------------------
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
> Issue Type: Sub-task
> Components: Spark Integration
> Reporter: Ethan Guo
> Assignee: Ethan Guo
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png,
> Screen Shot 2020-01-18 at 12.31.23 AM.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When using the `FilebasedSchemaProvider` to provide the source schema in
> Avro, while ingesting data from `ParquetDFSSource` with the same schema, the
> DeltaStreamer failed. A new test case is added below to demonstrate the
> error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet
> files in Spark, all fields are automatically [converted to be
> nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html]
> for compatibility reasons. If the source Avro schema has non-null fields,
> `AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe
> to convert the Row to Avro record. The `dataType` has nullable fields based
> on Spark logic, even though the field names are identical as the source Avro
> schema. Thus the resulting Avro records from the conversion have different
> schema (only nullability difference) compared to the source schema file.
> Before inserting the records, there are other operations using the source
> schema file, causing failure of serialization/deserialization because of this
> schema mismatch.
>
> The following screenshot shows the modified Avro schema in
> `AvroConversionUtils.createRdd`. The original source schema file is:
> !Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!
>
> !Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)