[GitHub] [hudi] bvaradar commented on a change in pull request #2012: [HUDI-1129] Deltastreamer Add support for schema evolution

GitBox Tue, 03 Nov 2020 05:38:50 -0800


bvaradar commented on a change in pull request #2012:
URL: https://github.com/apache/hudi/pull/2012#discussion_r516082122




##########
File path: hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala
##########
@@ -364,4 +366,40 @@ object AvroConversionHelper {
         }
     }
   }
+
+  /**
+   * Remove namespace from fixed field.
+   * org.apache.spark.sql.avro.SchemaConverters.toAvroType method adds 
namespace to fixed avro field
+   * 
https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L177
+   * So, we need to remove that namespace so that reader schema without 
namespace do not throw erorr like this one
+   * org.apache.avro.AvroTypeException: Found 
hoodie.source.hoodie_source.height.fixed, expecting fixed
+   *
+   * @param schema Schema from which namespace needs to be removed for fixed 
fields
+   * @return input schema with namespace removed for fixed fields, if any
+   */
+  def removeNamespaceFromFixedFields(schema: Schema): Schema  ={

Review comment:
       @n3nash : This might require holistic look at how schema evolution is 
handled.
   
   As a last option before I let @n3nash decide on how to best take in this 
change, @sathyaprakashg : Since this is not a backwards compatible change in 
the true sense (underlying type is same), Can you try adding a additional 
where, we do a variant of 
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkMergeHelper.java#L73
 
   
   In HoodieAvroDataBlock: 
   1. Use genericReader with only old schema. This will avoid schema evolution 
handling.
   2. Create a genericWriter and writes the record back to bytes but written 
with the new (updated) schema
   3. then use genericReader (like 1) to read but use the updated schema 
   
   Can you see if this works around the issue ? If it does, then this needs to 
be a configuration controlled feature when reading records from log records.

##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/BaseAvroPayload.java
##########
@@ -39,13 +40,19 @@
    */
   protected final Comparable orderingVal;
 
+  /**
+   * Schema used to convert avro to bytes.
+   */
+  protected final Schema writerSchema;

Review comment:
       You can introduce another base class BaseAvroPayloadWithSchema which 
extends from BaseAvroPayload and stores the schema. This will be the base class 
for any new implementation which needs to store schema as part of pyload




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bvaradar commented on a change in pull request #2012: [HUDI-1129] Deltastreamer Add support for schema evolution

Reply via email to