[jira] [Comment Edited] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion

sivabalan narayanan (Jira) Wed, 28 Apr 2021 04:06:06 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325964#comment-17325964
 ]


sivabalan narayanan edited comment on HUDI-1343 at 4/28/21, 11:05 AM:
----------------------------------------------------------------------

[~liujinhui] [~vbalaji] [~nishith29] : Do you folks think if this is still 
required after this fix [https://github.com/apache/hudi/pull/2765] . This fixes 
AvroConvertionUtils.convertStructTypeToAvroSchema() to ensure null is first 
entry in union and default value is set to null if a field is nullable in spark 
structtype. 

I mean, we have enabled the post schema processor by default. so wanted to 
double check if it's still applicable. 


was (Author: shivnarayan):
[~liujinhui] [~vbalaji]: Do you folks think if this is still required after 
this fix [https://github.com/apache/hudi/pull/2765] . This fixes 
AvroConvertionUtils.convertStructTypeToAvroSchema() to ensure null is first 
entry in union and default value is set to null if a field is nullable in spark 
structtype. 

I mean, we have enabled the post schema processor by default. so wanted to 
double check if it's still applicable. 

> Add standard schema postprocessor which would rewrite the schema using 
> spark-avro conversion
> --------------------------------------------------------------------------------------------
>
>                 Key: HUDI-1343
>                 URL: https://issues.apache.org/jira/browse/HUDI-1343
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: DeltaStreamer
>            Reporter: Balaji Varadarajan
>            Assignee: liujinhui
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.7.0
>
>
> When we use Transformer, the final Schema which we use to convert avro record 
> to bytes is auto generated by spark. This could be different (due to the way 
> Avro treats it) from the target schema that is being used to write (as the 
> target schema could be coming from Schema Registry). 
>  
> For example : 
> Schema generated by spark-avro when converting Row to avro
> {
>   "type" : "record",
>   "name" : "hoodie_source",
>   "namespace" : "hoodie.source",
>   "fields" : [ {
>     "name" : "_ts_ms",
>     "type" : [ "long", "null" ]
>   }, {
>     "name" : "_op",
>     "type" : "string"
>   }, {
>     "name" : "inc_id",
>     "type" : "int"
>   }, {
>     "name" : "year",
>     "type" : [ "int", "null" ]
>   }, {
>     "name" : "violation_desc",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "violation_code",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "case_individual_id",
>     "type" : [ "int", "null" ]
>   }, {
>     "name" : "flag",
>     "type" : [ "string", "null" ]
>   }, {
>     "name" : "last_modified_ts",
>     "type" : "long"
>   } ]
> }
>  
> is not compatible with the Avro Schema:
>  
> {
>   "type" : "record",
>   "name" : "formatted_debezium_payload",
>   "fields" : [ {
>     "name" : "_ts_ms",
>     "type" : [ "null", "long" ],
>     "default" : null
>   }, {
>     "name" : "_op",
>     "type" : "string",
>     "default" : null
>   }, {
>     "name" : "inc_id",
>     "type" : "int",
>     "default" : null
>   }, {
>     "name" : "year",
>     "type" : [ "null", "int" ],
>     "default" : null
>   }, {
>     "name" : "violation_desc",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "violation_code",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "case_individual_id",
>     "type" : [ "null", "int" ],
>     "default" : null
>   }, {
>     "name" : "flag",
>     "type" : [ "null", "string" ],
>     "default" : null
>   }, {
>     "name" : "last_modified_ts",
>     "type" : "long",
>     "default" : null
>   } ]
> }
>  
> Note that the type order is different for individual fields : 
> "type" : [ "null", "string" ], vs  "type" : [ "string", "null" ]
> Unexpectedly, Avro decoding fails when bytes written with first schema is 
> read using second schema.
>  
> One way to fix is to use configured target schema when generating record 
> bytes but this is not easy without breaking Record payload constructor API 
> used by deltastreamer. 
> The other option is to apply a post-processor on target schema to make it 
> schema consistent with Transformer generated records.
>  
> This ticket is to use the later approach of creating a standard schema 
> post-processor and adding it by default when Transformer is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-1343) Add standard schema postprocessor which would rewrite the schema using spark-avro conversion

Reply via email to