nbalajee commented on a change in pull request #2309:
URL: https://github.com/apache/hudi/pull/2309#discussion_r541180759
##########
File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
##########
@@ -308,17 +309,79 @@ public static GenericRecord
rewriteRecordWithOnlyNewSchemaFields(GenericRecord r
return rewrite(record, new LinkedHashSet<>(newSchema.getFields()),
newSchema);
}
+ private static void setDefaultVal(GenericRecord newRecord, Schema.Field f) {
+ if (f.defaultVal() instanceof JsonProperties.Null) {
+ newRecord.put(f.name(), null);
+ } else {
+ newRecord.put(f.name(), f.defaultVal());
+ }
+ }
+
+ /*
+ * OldRecord: NewRecord:
+ * field1 : String field1 : String
+ * field2 : record field2 : record
+ * field_21 : string field_21 : string
+ * field_22 : Integer field_22 : Integer
+ * field3: Integer field_23 : String
+ * field_24 : Integer
+ * field3: Integer
+ *
+ * When a nested record has changed/evolved, newRecord.put(field2,
oldRecord.get(field2)), is not sufficient.
+ * Requires a deep-copy/rewrite of the evolved field.
+ */
+ private static Object rewriteEvolvedFields(Object datum, Schema newSchema) {
+ switch (newSchema.getType()) {
+ case RECORD:
+ if (!(datum instanceof GenericRecord)) {
+ return datum;
+ }
+ GenericRecord record = (GenericRecord) datum;
+ // if schema of the record being rewritten does not match
+ // with the new schema, some nested records with schema change
+ // will require rewrite.
+ if (!record.getSchema().equals(newSchema)) {
+ GenericRecord newRecord = new GenericData.Record(newSchema);
+ for (Schema.Field f : newSchema.getFields()) {
+ if (record.get(f.name()) == null) {
+ setDefaultVal(newRecord, f);
+ } else {
+ newRecord.put(f.name(),
rewriteEvolvedFields(record.get(f.name()), f.schema()));
+ }
+ }
+ return newRecord;
+ }
+ return datum;
+ case UNION:
+ Integer idx = (newSchema.getTypes().get(0).getType() ==
Schema.Type.NULL) ? 1 : 0;
+ return rewriteEvolvedFields(datum, newSchema.getTypes().get(idx));
Review comment:
Added two test cases.
UNION is predominantly used for optional record - [null, {record}] pattern.
In the next step of the recursion, record performs the schema equivalence
check. Hence, thought we won't need the equivalence check here. Please let
me know if I missed something here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]