jonvex commented on code in PR #12006:
URL: https://github.com/apache/hudi/pull/12006#discussion_r1778803886
##########
hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java:
##########
@@ -107,73 +104,6 @@ public void validateRecordsInFileGroup(String tablePath,
validateRecordsInFileGroup(tablePath, actualRecordList, schema, fileSlice,
false);
}
- public abstract Comparable getComparableUTF8String(String value);
-
- @Test
- public void testCompareToComparable() throws Exception {
Review Comment:
This test is for testing that jerry-rigged comparison method that ethan
created as a stop gap. Now we are directly using comparable.compare().
The problem we are correcting is mostly a schema evolution issue (I explain
the other reason at the bottom).
Right now schema evolution handling is disjointed and we should eventually
try moving all the logic into the fg reader (but there will be a challenge in
reading file footer a single time in an engine agnostic way).
We evolve the base file in Spark3XParquetReader classes using
Spark3ParquetSchemaEvolutionUtils to cast fields to the reader type and add
missing fields. Then for avro log files it is mostly supported for us, but in
HoodieAvroDataBlock we still need some extra handling for type promotion to
string which is why we call recordNeedsRewriteForExtendedAvroTypePromotion and
need to call HoodieAvroUtils.rewriteRecordWithNewSchema if such type promotions
are needed.
But for delete blocks, we currently have not done any schema evolution. We
should consider preventing users from evolving the precombine. But as we
currently have not, it is likely that users have been doing that and we need
support
MIT issue explained:
there is an MIT edge case that we probably also should fix in the writer at
some point, but its pretty tricky. Here is a ticket about that issue
https://issues.apache.org/jira/browse/HUDI-8257. Basically the expression
payload runs spark sql evaluations: PAYLOAD_DELETE_CONDITION,
PAYLOAD_UPDATE_CONDITION_AND_ASSIGNMENTS,
PAYLOAD_INSERT_CONDITION_AND_ASSIGNMENTS, and they contain casting of the input
data, so we can't modify the schema of the input data or the assignments won't
work. precombine is stored by itself in hudi records, so it is using the
uncasted field value for the precombine. The question is, do we cast the input
df and modify the conditions in the MIT command, or do we just cast the
precombine in the writer?)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]