nsivabalan commented on code in PR #11381:
URL: https://github.com/apache/hudi/pull/11381#discussion_r1623751675
##########
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##########
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema
incomingSchema, InternalSche
typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
});
+ // mark columns missing from incoming schema as nullable
+ Set<String> visited = new HashSet<>();
+ diffFromOldSchema.stream()
+ // ignore meta fields
+ .filter(col -> !META_FIELD_NAMES.contains(col))
+ .sorted()
+ .forEach(col -> {
+ // if parent is marked as nullable, only update the parent and not
all the missing children field
+ String parent = TableChangesHelper.getParentName(col);
+ if (!visited.contains(parent)) {
+ typeChange.updateColumnNullability(col, true);
+ }
+ visited.add(col);
+ });
Review Comment:
I do see that we call this only when the config value is true within
HoodieSchemaUtils. but I am trying to be future proof and avoid mis-use of the
method.
##########
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##########
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema
incomingSchema, InternalSche
typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
});
+ // mark columns missing from incoming schema as nullable
+ Set<String> visited = new HashSet<>();
+ diffFromOldSchema.stream()
+ // ignore meta fields
+ .filter(col -> !META_FIELD_NAMES.contains(col))
+ .sorted()
+ .forEach(col -> {
+ // if parent is marked as nullable, only update the parent and not
all the missing children field
+ String parent = TableChangesHelper.getParentName(col);
+ if (!visited.contains(parent)) {
+ typeChange.updateColumnNullability(col, true);
+ }
+ visited.add(col);
+ });
Review Comment:
hey @the-other-tim-brown : where is the check for
"hoodie.write.set.null.for.missing.columns". I don't see it being used or
checked within AvroSchemaEvolutionUtils.reconcileSchema()
I am wondering how can we be sure that all callers will be calling this
method only when "hoodie.write.set.null.for.missing.columns" is set to true.
##########
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##########
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema
incomingSchema, InternalSche
typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
});
+ // mark columns missing from incoming schema as nullable
+ Set<String> visited = new HashSet<>();
Review Comment:
hey @jonvex : when we added support for
"hoodie.write.set.null.for.missing.columns", how did we miss this. the java
docs for this method does call out that nulls will be injected for missing
cols. but prior to this patch, I don't see that happening.
##########
hudi-common/src/main/java/org/apache/hudi/internal/schema/utils/AvroSchemaEvolutionUtils.java:
##########
@@ -113,6 +120,21 @@ public static InternalSchema reconcileSchema(Schema
incomingSchema, InternalSche
typeChange.updateColumnType(col, inComingInternalSchema.findType(col));
});
+ // mark columns missing from incoming schema as nullable
+ Set<String> visited = new HashSet<>();
Review Comment:
do you think we can rename 2 variable names to make it more comprehensible
or w/o ambiguity.
diffFromOldSchema -> missingFromIncoming
diffFromEvolutionColumns -> additionsToTableSchema or additionsToOldSchema
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]