bvaradar commented on issue #2162:
URL: https://github.com/apache/hudi/issues/2162#issuecomment-708261108


   I think this is due to the way spark deduces avro schema from Transformer 
ROw (in a way similar to 
https://github.com/apache/hudi/issues/2149#issuecomment-707624922)
   
   You can try changing the SchemaProvider to do the below steps where the 
target schema is recreated using Spark-avro. This would make it consistent with 
Transformer generated DF.
   
   Schema newSchema = AvroConversionUtils.convertStructTypeToAvroSchema(
   +        AvroConversionUtils.convertAvroSchemaToStructType(schema), 
RowBasedSchemaProvider.HOODIE_RECORD_STRUCT_NAME,
   +        RowBasedSchemaProvider.HOODIE_RECORD_NAMESPACE);
   
   
   If you are using  master branch, we have added support for plugging in 
SchemaPostProcessor (using config: 
hoodie.deltastreamer.schemaprovider.schema_post_processor=org.apache.hudi.utilities.schema.SparkAvroPostProcessor
   
   where you can implement the processSchema() method to do the above 
transformation.
   
   
   ```
   package org.apache.hudi.utilities.schema;
   
   import org.apache.hudi.AvroConversionUtils;
   import org.apache.hudi.common.config.TypedProperties;
   
   import org.apache.avro.Schema;
   import org.apache.spark.api.java.JavaSparkContext;
   
   public class SparkAvroPostProcessor extends SchemaPostProcessor {
   
     protected SparkAvroPostProcessor(TypedProperties props, JavaSparkContext 
jssc) {
       super(props, jssc);
     }
   
     @Override
     public Schema processSchema(Schema schema) {
       return AvroConversionUtils.convertStructTypeToAvroSchema(
           AvroConversionUtils.convertAvroSchemaToStructType(schema), 
RowBasedSchemaProvider.HOODIE_RECORD_STRUCT_NAME, 
           RowBasedSchemaProvider.HOODIE_RECORD_NAMESPACE);
     }
   }```
   
   
   If this works, I will open a PR (Jira: 
https://issues.apache.org/jira/browse/HUDI-1343)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to