stayrascal commented on a change in pull request #4724:
URL: https://github.com/apache/hudi/pull/4724#discussion_r815265857
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java
##########
@@ -58,6 +58,31 @@ default T preCombine(T oldValue, Properties properties) {
return preCombine(oldValue);
}
+ /**
+ *When more than one HoodieRecord have the same HoodieKey in the incoming
batch, this function combines them before attempting to insert/upsert by taking
in a property map.
+ *
+ * @param oldValue instance of the old {@link HoodieRecordPayload} to be
combined with.
+ * @param properties Payload related properties. For example pass the
ordering field(s) name to extract from value in storage.
+ * @param schema Schema used for record
+ * @return the combined value
+ */
+ @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
+ default T preCombine(T oldValue, Properties properties, Schema schema) {
Review comment:
Hi @alexeykudinkin , Thanks a lot for you detail clarification.
1. Regarding the design of `preCombine`, I'm clear now. I'm sorry I don't
know the detail of RFC-46, and also I didn't find the link RFC-46 from
[here](https://cwiki.apache.org/confluence/display/HUDI/RFC+Process), cloud you
please share the link?
2. and regarding the requirements for partial updates/overwrite, I saw some
same requirements from community. In my case, generally, we want to build a
customer profile with multiple attributes, these attributes might come from
different systems, one system might only provides some attributes in a
event/record, and two systems might the events/records with different
attributes, we should not only choose the recent one, we need to merged them
before writing to disk. Otherwise, we have to keep all change logs, and then
start a new job to dedup & merge these attributes among the change logs. For
example, we have 10 attributes a1-a10(all of them are optional), source system
A only has the a1-a5, source system B only has a6-a10, what result we expect is
that the final record contains a1-a10, not only a1-a5 or a6-a10. And because we
might receive two events/records in same time, they might be in a same batch,
that's why we want to merge them before `combineAndGetUpdateValue `.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]