stayrascal commented on a change in pull request #4724:
URL: https://github.com/apache/hudi/pull/4724#discussion_r815265857



##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java
##########
@@ -58,6 +58,31 @@ default T preCombine(T oldValue, Properties properties) {
     return preCombine(oldValue);
   }
 
+  /**
+   *When more than one HoodieRecord have the same HoodieKey in the incoming 
batch, this function combines them before attempting to insert/upsert by taking 
in a property map.
+   *
+   * @param oldValue instance of the old {@link HoodieRecordPayload} to be 
combined with.
+   * @param properties Payload related properties. For example pass the 
ordering field(s) name to extract from value in storage.
+   * @param schema Schema used for record
+   * @return the combined value
+   */
+  @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
+  default T preCombine(T oldValue, Properties properties, Schema schema) {

Review comment:
       Hi @alexeykudinkin , Thanks a lot for you detail clarification.
   1. Regarding the design of `preCombine`, I'm clear now. I'm sorry I don't 
know the detail of RFC-46, and also I didn't find the link RFC-46 from 
[here](https://cwiki.apache.org/confluence/display/HUDI/RFC+Process), cloud you 
please share the link?
   2. and regarding the requirements for partial updates/overwrite, I saw some 
same requirements from community. In my case, generally, we want to build a 
customer profile with multiple attributes, these attributes might come from 
different systems, one system might only provides some attributes in a 
event/record, and two systems might the events/records with different 
attributes, we should not only choose the recent one, we need to merged them 
before writing to disk. Otherwise, we have to keep all change logs, and then 
start a new job to dedup & merge these attributes among the change logs. For 
example, we have 10 attributes a1-a10(all of them are optional), source system 
A only has the a1-a5, source system B only has a6-a10, what result we expect is 
that the final record contains a1-a10, not only a1-a5 or a6-a10. And because we 
might receive two events/records in same time, they might be in a same batch, 
that's why we want to merge them before `combineAndGetUpdateValue `.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to