Re: [PR] [HUDI-7436] Fix the conditions for determining whether the records need to be rewritten [hudi]

via GitHub Thu, 14 Mar 2024 03:02:02 -0700


ThinkerLei commented on code in PR #10727:
URL: https://github.com/apache/hudi/pull/10727#discussion_r1524571267



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java:
##########
@@ -202,7 +202,9 @@ private Option<Function<HoodieRecord, HoodieRecord>> 
composeSchemaEvolutionTrans
       Schema newWriterSchema = 
AvroInternalSchemaConverter.convert(mergedSchema, writerSchema.getFullName());
       Schema writeSchemaFromFile = 
AvroInternalSchemaConverter.convert(writeInternalSchema, 
newWriterSchema.getFullName());
       boolean needToReWriteRecord = sameCols.size() != 
colNamesFromWriteSchema.size()
-          || 
SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType() == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE;
+          && 
SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType()
+          == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE;
+

Review Comment:
   > @danny0405 @ThinkerLei i think again. SameCols. size() == 
colNamesFromWriteScheme. size() only happen in following scence The table has 
new columns, while the old columns have not been changed(rename, type change). 
eg:
   > 
   > ```
   > write schema: a string, b int, c long
   > read schema: a string, b int, c long, d int
   > ```
   > 
   > In this case SameCols. size() == colNamesFromWriteScheme. size(). and, 
writeSchema is equivalent to a pruned readschema.
   > 
   > However, some versions of AVRO, such as AVRO 1.8. x , may report errors 
when using pruned schemas to read AVRO files. (avro 1.10x has no such problem)
   > 
   > Therefore, even if sameCols. size() == colNamesFromWriteScheme. size(), we 
still need to check the compatibility of the read-write schema. If it is 
compatible, we can directly use this writeSchema to read avo data.
   > 
   > maybe we can use following logic to avoid unnecessary rewrite.
   > 
   > ```
   > boolean needToReWriteRecord =  sameCols.size() != 
colNamesFromWriteSchema.size() || 
   > !SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType() == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE
   > ```
   
   This was the logic I initially fixed. Do I still need to make changes based 
on this PR? cc @danny0405 @xiarixiaoyao 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7436] Fix the conditions for determining whether the records need to be rewritten [hudi]

Reply via email to