Re: [PR] [HUDI-7848] Prevent ordering value cast exceptions in filegroup reader [hudi]

via GitHub Fri, 27 Sep 2024 08:30:12 -0700


jonvex commented on code in PR #12006:
URL: https://github.com/apache/hudi/pull/12006#discussion_r1778803886



##########
hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java:
##########
@@ -107,73 +104,6 @@ public void validateRecordsInFileGroup(String tablePath,
     validateRecordsInFileGroup(tablePath, actualRecordList, schema, fileSlice, 
false);
   }
 
-  public abstract Comparable getComparableUTF8String(String value);
-
-  @Test
-  public void testCompareToComparable() throws Exception {

Review Comment:
   This test is for testing that jerry-rigged comparison method that ethan 
created as a stop gap. Now we are directly using comparable.compare().
   
   The problem we are correcting is mostly a schema evolution issue (I explain 
the other reason at the bottom).
   Right now schema evolution handling is disjointed and we should eventually 
try moving all the logic into the fg reader (but there will be a challenge in 
reading file footer a single time in an engine agnostic way). 
   We evolve the base file in Spark3XParquetReader classes using 
Spark3ParquetSchemaEvolutionUtils to cast fields to the reader type and add 
missing fields. Then for avro log files it is mostly supported for us, but in 
HoodieAvroDataBlock we still need some extra handling for type promotion to 
string which is why we call recordNeedsRewriteForExtendedAvroTypePromotion and 
need to call HoodieAvroUtils.rewriteRecordWithNewSchema if such type promotions 
are needed. 
   
   But for delete blocks, we currently have not done any schema evolution. We 
should consider preventing users from evolving the precombine. But as we 
currently have not, it is likely that users have been doing that and we need 
support
   
   MIT issue explained:
   there is an MIT edge case that we probably also should fix in the writer at 
some point, but its pretty tricky. Here is a ticket about that issue 
https://issues.apache.org/jira/browse/HUDI-8257. Basically the expression 
payload runs spark sql evaluations: PAYLOAD_DELETE_CONDITION, 
PAYLOAD_UPDATE_CONDITION_AND_ASSIGNMENTS, 
PAYLOAD_INSERT_CONDITION_AND_ASSIGNMENTS, and they contain casting of the input 
data, so we can't modify the schema of the input data or the assignments won't 
work. precombine is stored by itself  in hudi records, so it is using the 
uncasted field value for the precombine. The question is, do we cast the input 
df and modify the conditions in the MIT command, or do we just cast the 
precombine in the writer?)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7848] Prevent ordering value cast exceptions in filegroup reader [hudi]

Reply via email to