[
https://issues.apache.org/jira/browse/HUDI-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17899562#comment-17899562
]
Y Ethan Guo commented on HUDI-8521:
-----------------------------------
tl;dr: Precombine is supposed to be used for deduping records within the same
batch only. It should not be used in log merging which intends to get updates
across batches
We found one issue with OVERWRITE_WITH_LATEST merge mode where the wrong record
is picked after compaction in MOR table.
This is related to the "preCombine" behavior introduced in [Pull
Request[HUDI-6368] Strength avro record
merger|https://github.com/apache/hudi/pull/8953]
This PR introduces a way to differentiate "preCombine" and
"combineAndGetUpdateValue" logic corresponding to the payload class
implementation in the new record merger.
preCombine is supposed to be used for deduping records within the same batch
only.
This API provides the merger to use for "preCombine" only, i.e., a different
record merger HoodiePreCombineAvroRecordMerger.INSTANCE is used when calling
asPreCombiningMode to return the precombine merger to use, aside from the
original merger.
HoodieRecordUtils.mergerToPreCombineMode
Note that only merger implementing OperationModeAwareness will leverage the
logic above to differentiate "preCombine" and "combineAndGetUpdateValue" logic.
Merger without implementing OperationModeAwareness has the same consistent
merging behavior across "preCombine" and "combineAndGetUpdateValue".
We need such differentiation for OVERWRITE_WITH_LATEST or COMMIT_TIME_ORDERING
mode, because in this mode the dedup within a batch uses "preCombine" which
takes the record with the latest ordering field on the same record key, while
updates across batches use "combineAndGetUpdateValue" which takes the record
with the latest commit/processing time, without considering ordering field
value.
Right now, only two merger implementation classes support such differentiation
of precombine: * HoodieAvroRecordMerger
** This is used to be backwards compatible with and use the payload class
implementation, which can differentiate "preCombine" and
"combineAndGetUpdateValue"
** Returns HoodiePreCombineAvroRecordMerger.INSTANCE from #asPreCombiningMode
* HoodieSparkValidateDuplicateKeyRecordMerger
** Not used right now
** Returns
HoodieRecordUtils.loadRecordMerger(classOf[DefaultSparkRecordMerger].getName)
from #asPreCombiningMode
> Resolve issues w/ diff merge modes
> -----------------------------------
>
> Key: HUDI-8521
> URL: https://issues.apache.org/jira/browse/HUDI-8521
> Project: Apache Hudi
> Issue Type: Bug
> Components: reader-core, writer-core
> Reporter: sivabalan narayanan
> Assignee: Y Ethan Guo
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.0.0
>
>
> we found some issues w/ merge mode feature in 1.x.
> we need to triage them and fix
--
This message was sent by Atlassian Jira
(v8.20.10#820010)