alexeykudinkin commented on a change in pull request #4724:
URL: https://github.com/apache/hudi/pull/4724#discussion_r815187048
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java
##########
@@ -58,6 +58,31 @@ default T preCombine(T oldValue, Properties properties) {
return preCombine(oldValue);
}
+ /**
+ *When more than one HoodieRecord have the same HoodieKey in the incoming
batch, this function combines them before attempting to insert/upsert by taking
in a property map.
+ *
+ * @param oldValue instance of the old {@link HoodieRecordPayload} to be
combined with.
+ * @param properties Payload related properties. For example pass the
ordering field(s) name to extract from value in storage.
+ * @param schema Schema used for record
+ * @return the combined value
+ */
+ @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
+ default T preCombine(T oldValue, Properties properties, Schema schema) {
Review comment:
Let me try to clarify a few things:
`preCombine` has a _very specific_ semantic: it's de-duplicating by the way
of picking "most recent" among records in the batch. Expectation always is that
it being handed 2 records it will **have to** return either of them. It could
not produce new record. If we want to revisit this semantic this is a far
larger change that will surely require writing an RFC and broader discussion
regarding the merits of such migration. Please also keep in mind that as of
RFC-46 there's an effort underway to abstract whole "record
combination/merging" semantic out of `RecordPayload` hierarchy into standalone
Combination/Merge Engine API.
> First, from the description of preCombine method, it used for combining
multiple records with same HoodieKey before attempting to insert/upsert to
disk. The "combine multiple records" might not mean only choosing one of them,
we also can combine & merged them to a new one, just depends on how the
sub-class implement the preCombine logic(Please correct me if my understanding
is wrong :) ). Yeah, it might be a little bit confused that we need Schema if
we are trying to merged them.
Please see my comment regarding `preCombine` semantic above. I certainly
agree with you that the name is confusing, but i've tried to clear that
confusion. Let me know if you have more questions about it.
> Second, I checked when will we call preCombine method is trying to
duplicate records with same HoodieKey before insert/update to disk, especially
in Flink write case, even through the duplicated logic is choose the latest
record, but we need to ensure that one HoodieKey should only contains one
record before comparing to existing record and write to disk, otherwise, some
records will missed. For example, in HoodieMergeHandle.init(fieId,
newRecordsIter), it will convert the record iterator to a map and treat the
recordKey as key. So we might not stop de-duping logics and merge them against
what is on disk unless we change the logic here. And also we implement another
class/method to handle the merge logic, and switch the existing de-duping logic
from calling preCombine to new class/method, we have to add an condition to
control whether should we call preCombine or not, I think it might not a good
way. Instead, we should handle it in preCombine method by different implemented
payl
oad.
You're bringing up a good points, let's dive into them one by one: so
currently we have 2 mechanisms
1. `preCombine` that allows to select "most recent" record among those
having the same key w/in the batch
2. `combineAndGetUpdateValue` that allows to combine previous or
"historical" record (on Disk) with the new incoming one (all partial merging
semantic is currently implemented in this method)
You rightfully mention some of the invariants are currently that the batch
would be de-duped at certain level (b/c we have to maintain PK uniqueness on
disk), and so we might need to shift that to accommodate for case that you
have. And that's exactly what my question was: if you can elaborate on use-case
that you have at hand that you're trying to solve w/ this PR, i would be able
to better understand where you're coming from and what's the best path forward
for us here.
Questions i'm looking an answers for are basically following:
1. What's nature of your use-case? (domain, record types, frequency, size,
etc)
2. Where requirements for partial updates are coming from?
and etc. I'm happy to set some 30min to talk in person regarding this or
connect on Slack and discuss it there.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]