[jira] [Commented] (HUDI-8521) Resolve issues w/ diff merge modes

Y Ethan Guo (Jira) Tue, 19 Nov 2024 16:38:25 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899562#comment-17899562
 ]


Y Ethan Guo commented on HUDI-8521:
-----------------------------------

tl;dr: Precombine is supposed to be used for deduping records within the same 
batch only. It should not be used in log merging which intends to get updates 
across batches

We found one issue with OVERWRITE_WITH_LATEST merge mode where the wrong record 
is picked after compaction in MOR table.

This is related to the "preCombine" behavior introduced in [Pull 
Request[HUDI-6368] Strength avro record 
merger|https://github.com/apache/hudi/pull/8953]
 
This PR introduces a way to differentiate "preCombine" and 
"combineAndGetUpdateValue" logic corresponding to the payload class 
implementation in the new record merger.
 
preCombine is supposed to be used for deduping records within the same batch 
only.
 
This API provides the merger to use for "preCombine" only, i.e., a different 
record merger HoodiePreCombineAvroRecordMerger.INSTANCE is used when calling 
asPreCombiningMode to return the precombine merger to use, aside from the 
original merger.
HoodieRecordUtils.mergerToPreCombineMode
 
Note that only merger implementing OperationModeAwareness will leverage the 
logic above to differentiate "preCombine" and "combineAndGetUpdateValue" logic. 
Merger without implementing OperationModeAwareness has the same consistent 
merging behavior across "preCombine" and "combineAndGetUpdateValue".
 
We need such differentiation for OVERWRITE_WITH_LATEST or COMMIT_TIME_ORDERING 
mode, because in this mode the dedup within a batch uses "preCombine" which 
takes the record with the latest ordering field on the same record key, while 
updates across batches use "combineAndGetUpdateValue" which takes the record 
with the latest commit/processing time, without considering ordering field 
value.
 
Right now, only two merger implementation classes support such differentiation 
of precombine: * HoodieAvroRecordMerger
 ** This is used to be backwards compatible with and use the payload class 
implementation, which can differentiate "preCombine" and 
"combineAndGetUpdateValue"
 ** Returns HoodiePreCombineAvroRecordMerger.INSTANCE from #asPreCombiningMode
 * HoodieSparkValidateDuplicateKeyRecordMerger
 ** Not used right now
 ** Returns 
HoodieRecordUtils.loadRecordMerger(classOf[DefaultSparkRecordMerger].getName) 
from #asPreCombiningMode

> Resolve issues w/ diff merge modes 
> -----------------------------------
>
>                 Key: HUDI-8521
>                 URL: https://issues.apache.org/jira/browse/HUDI-8521
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: reader-core, writer-core
>            Reporter: sivabalan narayanan
>            Assignee: Y Ethan Guo
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>
> we found some issues w/ merge mode feature in 1.x. 
> we need to triage them and fix 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-8521) Resolve issues w/ diff merge modes

Reply via email to