[GitHub] [hudi] bhasudha commented on a change in pull request #1704: [HUDI-115] Enhance OverwriteWithLatestAvroPayload to also respect ordering value of record in storage

GitBox Fri, 12 Jun 2020 14:25:11 -0700


bhasudha commented on a change in pull request #1704:
URL: https://github.com/apache/hudi/pull/1704#discussion_r439648391




##########
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java
##########
@@ -50,8 +50,25 @@
    * @param schema Schema used for record
    * @return new combined/merged value to be written back to storage. EMPTY to 
skip writing this record.
    */
+  @Deprecated
   Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue, 
Schema schema) throws IOException;
 
+  /**
+   * This methods lets you write custom merging/combining logic to produce new 
values as a function of current value on
+   * storage and whats contained in this object.
+   * <p>
+   * eg: 1) You are updating counters, you may want to add counts to 
currentValue and write back updated counts 2) You
+   * may be reading DB redo logs, and merge them with current image for a 
database row on storage
+   *
+   * @param currentValue Current value in storage, to merge/combine this 
payload with
+   * @param schema Schema used for record
+   * @param props Payload related properties. For example pass the ordering 
field(s) name to extract from value in storage.
+   * @return new combined/merged value to be written back to storage. EMPTY to 
skip writing this record.
+   */
+  default Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord 
currentValue, Schema schema, Map<String, String> props) throws IOException {

Review comment:
       Thanks for the clarification @n3nash . I am not able to open the link 
either. It could be because the initial reporter had different jira ids or 
something else. But the comments section has good context on what this ticket 
is about. Also, this has come across multiple times in Slack channels from 
different users. Based on that I can summarize as follows.
   
    Users are looking to consider records in disk when using  
`OverwriteWithLatestAvro` payload class. In `preCombine` step - we pick the 
latest value for every key from a batch of input data based on some ordering 
field. At this step we dont know what is in storage yet. When we are ready to 
write the records, we iterate the records in storage, for each record determine 
if there is a matching entry in the input batch. If so, we invoke 
`combineAndGetUpdateValue` and pass in the record in disk as the `currentValue` 
param. In this step specifically, `OverwriteWithLatestAvro` can possibly 
overwrite an already stored record with an older update since we werent 
comparing the ordering val of new data with data in disk. This whole PR is to 
provide that capability instead of asking Users to write their own 
implementation to do this. w.r.t `preCombine` and `combineAndGetUpdateValue` 
both will be using the same ordering field(s) to determine the latest. 
   
   In order to not disrupt other payload classes, I deprecated the existing 
`combineAndGetUpdateValue`, extended it and provided a default implementation 
that would ignore the Map field and internally do the old way. 
`OverwriteWithLatestAvroPayload` alone will override this default functionality 
to achieve the above purpose.
   
   Hope that helps!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bhasudha commented on a change in pull request #1704: [HUDI-115] Enhance OverwriteWithLatestAvroPayload to also respect ordering value of record in storage

Reply via email to