Re: [PR] [HUDI-7436] Fix the conditions for determining whether the records need to be rewritten [hudi]

via GitHub Thu, 14 Mar 2024 00:09:52 -0700


xiarixiaoyao commented on code in PR #10727:
URL: https://github.com/apache/hudi/pull/10727#discussion_r1524335926



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieMergeHelper.java:
##########
@@ -202,7 +202,9 @@ private Option<Function<HoodieRecord, HoodieRecord>> 
composeSchemaEvolutionTrans
       Schema newWriterSchema = 
AvroInternalSchemaConverter.convert(mergedSchema, writerSchema.getFullName());
       Schema writeSchemaFromFile = 
AvroInternalSchemaConverter.convert(writeInternalSchema, 
newWriterSchema.getFullName());
       boolean needToReWriteRecord = sameCols.size() != 
colNamesFromWriteSchema.size()
-          || 
SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType() == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE;
+          && 
SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType()
+          == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE;
+

Review Comment:
   @danny0405  @ThinkerLei 
   i think again.
   SameCols. size() == colNamesFromWriteScheme. size() only happen in following 
scence
   The table has new columns, while the old columns have not been 
changed(rename, type change).
   eg:
   ```
   write schema: a string, b int, c long
   read schema: a string, b int, c long, d int
   ```
   In this case
   SameCols. size() == colNamesFromWriteScheme. size().
   and, writeSchema is equivalent to a pruned readschema.
   
   However, some versions of AVRO, such as AVRO 1.8. x , may report errors when 
using pruned schemas to read AVRO files. (avro 1.10x has no such problem)
   
   Therefore, even if sameCols. size() == colNamesFromWriteScheme. size(), we 
still need to check the compatibility of the read-write schema. If it is 
compatible, we can directly use this writeSchema to read avo data.
   
   
   Therefore, maybe we can use following logic to avoid unnecessary rewrite.
   ```
   boolean needToReWriteRecord =  sameCols.size() != 
colNamesFromWriteSchema.size() || 
   !SchemaCompatibility.checkReaderWriterCompatibility(newWriterSchema, 
writeSchemaFromFile).getType() == 
org.apache.avro.SchemaCompatibility.SchemaCompatibilityType.COMPATIBLE
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7436] Fix the conditions for determining whether the records need to be rewritten [hudi]

Reply via email to