nsivabalan commented on code in PR #7942:
URL: https://github.com/apache/hudi/pull/7942#discussion_r1106035940
##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data,
deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input
records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the
aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use
cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is
used in the Hudi upsert path.
+
+<figure>
+ <img className="docimage"
src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written
to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and
precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and
the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with
the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks
the latest record while merging. This gives
Review Comment:
actually there is little more to this. lets land this doc for 0.13.0. but as
an immediate follow up, address these comments.
we have precombine and combineAndGetUpdate method used in diff occasions.
so calling out just preCombine may not be right. bcoz, when merging w/ whats
in storage, we ignore the preCombine value specifically in this
payoad(OverwriteWithLatestAvroPayload)
##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data,
deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input
records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the
aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use
cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is
used in the Hudi upsert path.
+
+<figure>
+ <img className="docimage"
src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written
to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and
precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and
the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with
the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks
the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays
the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use
cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD)
configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on
some conditional expression, especially
Review Comment:
should we remove this from this list. I thought its meant to be used only
internally. can anyone directly set expression payload for their table?
##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data,
deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input
records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the
aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use
cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is
used in the Hudi upsert path.
+
+<figure>
+ <img className="docimage"
src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written
to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and
precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and
the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with
the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks
the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays
the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use
cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD)
configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on
some conditional expression, especially
+when updating records using [`MERGE INTO`](/docs/quick-start-guide#mergeinto)
statement.
+
+#### Payload to support partial update
+
+Typically, once the merge step resolves which record to pick, then the record
on storage is fully replaced by the
+resolved record. But, in some cases, the requirement is to update only certain
fields and not replace the whole record.
+This is called partial update.
+`PartialUpdateAvroPayload` in Hudi provides out-box-support for such use
cases. To illustrate the point, let us look at
+a simple example:
+
+Let's say the order field is `ts` and schema is :
+
+```
+{
+ [
+ {"name":"id","type":"string"},
+ {"name":"ts","type":"long"},
+ {"name":"name","type":"string"},
+ {"name":"price","type":"string"}
+ ]
+}
+```
+
+Current record in storage:
+
+```
+ id ts name price
+ 1 2 name_1 null
+```
+
+Incoming record:
+
+```
+ id ts name price
+ 1 1 null price_1
+```
+
+Result data after merging using `PartialUpdateAvroPayload`:
+
+```
+ id ts name price
+ 1 2 name_1 price_1
Review Comment:
how is ts's value is 2? its not intuitive to me. I thought, for null values
in new incoming, we will choose older value.
##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data,
deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input
records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the
aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use
cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is
used in the Hudi upsert path.
+
+<figure>
+ <img className="docimage"
src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written
to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and
precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and
the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with
the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks
the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays
the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use
cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD)
configuration.
+
+#### ExpressionPayload
+
+This payload is very useful when you want to merge or delete records based on
some conditional expression, especially
+when updating records using [`MERGE INTO`](/docs/quick-start-guide#mergeinto)
statement.
+
+#### Payload to support partial update
+
+Typically, once the merge step resolves which record to pick, then the record
on storage is fully replaced by the
+resolved record. But, in some cases, the requirement is to update only certain
fields and not replace the whole record.
+This is called partial update.
+`PartialUpdateAvroPayload` in Hudi provides out-box-support for such use
cases. To illustrate the point, let us look at
+a simple example:
+
+Let's say the order field is `ts` and schema is :
+
+```
+{
+ [
+ {"name":"id","type":"string"},
+ {"name":"ts","type":"long"},
+ {"name":"name","type":"string"},
+ {"name":"price","type":"string"}
+ ]
+}
+```
+
+Current record in storage:
+
+```
+ id ts name price
+ 1 2 name_1 null
+```
+
+Incoming record:
+
+```
+ id ts name price
+ 1 1 null price_1
+```
+
+Result data after merging using `PartialUpdateAvroPayload`:
+
+```
+ id ts name price
+ 1 2 name_1 price_1
+```
+
+There are quite a few other implementations provided by Hudi. For example,
`MySqlDebeziumAvroPayload` and
+`PostgresDebeziumAvroPayload` provides support for seamlessly applying changes
captured via Debezium for MySQL and
+PostgresDB.
+`AWSDmsAvroPayload` provides support for applying changes captured via Amazon
Database Migration Service onto S3.
Review Comment:
OverwriteNonDefaultsWithLatestAvroPayload, DefaultHoodieRecordPayload
##########
website/docs/record_payload.md:
##########
@@ -0,0 +1,97 @@
+---
+title: Record Payload
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data,
deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input
records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the
aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use
cases, and a new record merger API for
+optimized payload handling. But, first let us understand how record payload is
used in the Hudi upsert path.
+
+<figure>
+ <img className="docimage"
src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written
to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and
precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and
the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with
the greatest value (determined by calling
+.compareTo() on the value of precombine key) to break ties and simply picks
the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays
the role of an ordering field. This
Review Comment:
curious to know, how is this diff from using DefaultHoodieRecordPayload
where we use the event time as the payload ordering field.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]