[
https://issues.apache.org/jira/browse/HUDI-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325964#comment-17325964
]
sivabalan narayanan edited comment on HUDI-1343 at 4/28/21, 11:05 AM:
----------------------------------------------------------------------
[~liujinhui] [~vbalaji] [~nishith29] : Do you folks think if this is still
required after this fix [https://github.com/apache/hudi/pull/2765] . This fixes
AvroConvertionUtils.convertStructTypeToAvroSchema() to ensure null is first
entry in union and default value is set to null if a field is nullable in spark
structtype.
I mean, we have enabled the post schema processor by default. so wanted to
double check if it's still applicable.
was (Author: shivnarayan):
[~liujinhui] [~vbalaji]: Do you folks think if this is still required after
this fix [https://github.com/apache/hudi/pull/2765] . This fixes
AvroConvertionUtils.convertStructTypeToAvroSchema() to ensure null is first
entry in union and default value is set to null if a field is nullable in spark
structtype.
I mean, we have enabled the post schema processor by default. so wanted to
double check if it's still applicable.
> Add standard schema postprocessor which would rewrite the schema using
> spark-avro conversion
> --------------------------------------------------------------------------------------------
>
> Key: HUDI-1343
> URL: https://issues.apache.org/jira/browse/HUDI-1343
> Project: Apache Hudi
> Issue Type: Improvement
> Components: DeltaStreamer
> Reporter: Balaji Varadarajan
> Assignee: liujinhui
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.7.0
>
>
> When we use Transformer, the final Schema which we use to convert avro record
> to bytes is auto generated by spark. This could be different (due to the way
> Avro treats it) from the target schema that is being used to write (as the
> target schema could be coming from Schema Registry).
>
> For example :
> Schema generated by spark-avro when converting Row to avro
> {
> "type" : "record",
> "name" : "hoodie_source",
> "namespace" : "hoodie.source",
> "fields" : [ {
> "name" : "_ts_ms",
> "type" : [ "long", "null" ]
> }, {
> "name" : "_op",
> "type" : "string"
> }, {
> "name" : "inc_id",
> "type" : "int"
> }, {
> "name" : "year",
> "type" : [ "int", "null" ]
> }, {
> "name" : "violation_desc",
> "type" : [ "string", "null" ]
> }, {
> "name" : "violation_code",
> "type" : [ "string", "null" ]
> }, {
> "name" : "case_individual_id",
> "type" : [ "int", "null" ]
> }, {
> "name" : "flag",
> "type" : [ "string", "null" ]
> }, {
> "name" : "last_modified_ts",
> "type" : "long"
> } ]
> }
>
> is not compatible with the Avro Schema:
>
> {
> "type" : "record",
> "name" : "formatted_debezium_payload",
> "fields" : [ {
> "name" : "_ts_ms",
> "type" : [ "null", "long" ],
> "default" : null
> }, {
> "name" : "_op",
> "type" : "string",
> "default" : null
> }, {
> "name" : "inc_id",
> "type" : "int",
> "default" : null
> }, {
> "name" : "year",
> "type" : [ "null", "int" ],
> "default" : null
> }, {
> "name" : "violation_desc",
> "type" : [ "null", "string" ],
> "default" : null
> }, {
> "name" : "violation_code",
> "type" : [ "null", "string" ],
> "default" : null
> }, {
> "name" : "case_individual_id",
> "type" : [ "null", "int" ],
> "default" : null
> }, {
> "name" : "flag",
> "type" : [ "null", "string" ],
> "default" : null
> }, {
> "name" : "last_modified_ts",
> "type" : "long",
> "default" : null
> } ]
> }
>
> Note that the type order is different for individual fields :
> "type" : [ "null", "string" ], vs "type" : [ "string", "null" ]
> Unexpectedly, Avro decoding fails when bytes written with first schema is
> read using second schema.
>
> One way to fix is to use configured target schema when generating record
> bytes but this is not easy without breaking Record payload constructor API
> used by deltastreamer.
> The other option is to apply a post-processor on target schema to make it
> schema consistent with Transformer generated records.
>
> This ticket is to use the later approach of creating a standard schema
> post-processor and adding it by default when Transformer is used.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)