yihua commented on code in PR #10772:
URL: https://github.com/apache/hudi/pull/10772#discussion_r1506449192


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/KafkaOffsetPostProcessor.java:
##########
@@ -65,16 +63,29 @@ public KafkaOffsetPostProcessor(TypedProperties props, 
JavaSparkContext jssc) {
   @Override
   public Schema processSchema(Schema schema) {
     // this method adds kafka offset fields namely source offset, partition, 
timestamp and kafka message key to the schema of the batch.
+    List<Schema.Field> fieldList = schema.getFields();
+    Set<String> fieldNames = 
fieldList.stream().map(Schema.Field::name).collect(Collectors.toSet());
+    // if the source schema already contains the kafka offset fields, then 
return the schema as is.
+    if (fieldNames.containsAll(Arrays.asList(KAFKA_SOURCE_OFFSET_COLUMN, 
KAFKA_SOURCE_PARTITION_COLUMN, KAFKA_SOURCE_TIMESTAMP_COLUMN, 
KAFKA_SOURCE_KEY_COLUMN))) {
+      return schema;
+    }
     try {
-      List<Schema.Field> fieldList = schema.getFields();
       List<Schema.Field> newFieldList = fieldList.stream()
           .map(f -> new Schema.Field(f.name(), f.schema(), f.doc(), 
f.defaultVal())).collect(Collectors.toList());
-      newFieldList.add(new Schema.Field(KAFKA_SOURCE_OFFSET_COLUMN, 
Schema.create(Schema.Type.LONG), "offset column", 0));
-      newFieldList.add(new Schema.Field(KAFKA_SOURCE_PARTITION_COLUMN, 
Schema.create(Schema.Type.INT), "partition column", 0));
-      newFieldList.add(new Schema.Field(KAFKA_SOURCE_TIMESTAMP_COLUMN, 
Schema.create(Schema.Type.LONG), "timestamp column", 0));
-      newFieldList.add(new Schema.Field(KAFKA_SOURCE_KEY_COLUMN, 
createNullableSchema(Schema.Type.STRING), "kafka key column", 
JsonProperties.NULL_VALUE));
-      Schema newSchema = Schema.createRecord(schema.getName() + "_processed", 
schema.getDoc(), schema.getNamespace(), false, newFieldList);
-      return newSchema;
+      // handle case where source schema provider may have already set 1 or 
more of these fields
+      if (!fieldNames.contains(KAFKA_SOURCE_OFFSET_COLUMN)) {

Review Comment:
   Can you add a simple test around the changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to