bhasudha commented on a change in pull request #1704:
URL: https://github.com/apache/hudi/pull/1704#discussion_r439648391
##########
File path:
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java
##########
@@ -50,8 +50,25 @@
* @param schema Schema used for record
* @return new combined/merged value to be written back to storage. EMPTY to
skip writing this record.
*/
+ @Deprecated
Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord currentValue,
Schema schema) throws IOException;
+ /**
+ * This methods lets you write custom merging/combining logic to produce new
values as a function of current value on
+ * storage and whats contained in this object.
+ * <p>
+ * eg: 1) You are updating counters, you may want to add counts to
currentValue and write back updated counts 2) You
+ * may be reading DB redo logs, and merge them with current image for a
database row on storage
+ *
+ * @param currentValue Current value in storage, to merge/combine this
payload with
+ * @param schema Schema used for record
+ * @param props Payload related properties. For example pass the ordering
field(s) name to extract from value in storage.
+ * @return new combined/merged value to be written back to storage. EMPTY to
skip writing this record.
+ */
+ default Option<IndexedRecord> combineAndGetUpdateValue(IndexedRecord
currentValue, Schema schema, Map<String, String> props) throws IOException {
Review comment:
Thanks for the clarification @n3nash . I am not able to open the link
either. It could be because the initial reporter had different jira ids or
something else. But the comments section has good context on what this ticket
is about. Also, this has come across multiple times in Slack channels from
different users. Based on that I can summarize as follows.
Users are looking to consider records in disk when using
`OverwriteWithLatestAvro` payload class. In `preCombine` step - we pick the
latest value for every key from a batch of input data based on some ordering
field. At this step we dont know what is in storage yet. When we are ready to
write the records, we iterate the records in storage, for each record determine
if there is a matching entry in the input batch. If so, we invoke
`combineAndGetUpdateValue` and pass in the record in disk as the `currentValue`
param. In this step specifically, `OverwriteWithLatestAvro` can possibly
overwrite an already stored record with an older update since we werent
comparing the ordering val of new data with data in disk. This whole PR is to
provide that capability instead of asking Users to write their own
implementation to do this. w.r.t `preCombine` and `combineAndGetUpdateValue`
both will be using the same ordering field(s) to determine the latest.
In order to not disrupt other payload classes, I deprecated the existing
`combineAndGetUpdateValue`, extended it and provided a default implementation
that would ignore the Map field and internally do the old way.
`OverwriteWithLatestAvroPayload` alone will override this default functionality
to achieve the above purpose.
Hope that helps!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]