This is an automated email from the ASF dual-hosted git repository.
sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 0ea1cfe616d [HUDI-5753] Add docs for record payload (#7942)
0ea1cfe616d is described below
commit 0ea1cfe616de72a0518a7d7a2b28f0a793caddc5
Author: Sagar Sumit <[email protected]>
AuthorDate: Tue Apr 11 01:05:25 2023 +0530
[HUDI-5753] Add docs for record payload (#7942)
---
website/docs/record_payload.md | 143 +++++++++++++++++++++++++++
website/sidebars.js | 1 +
website/static/assets/images/upsert_path.png | Bin 0 -> 32830 bytes
3 files changed, 144 insertions(+)
diff --git a/website/docs/record_payload.md b/website/docs/record_payload.md
new file mode 100644
index 00000000000..3fcc05d8b6d
--- /dev/null
+++ b/website/docs/record_payload.md
@@ -0,0 +1,143 @@
+---
+title: Record Payload
+keywords: [hudi, merge, upsert, precombine]
+---
+
+## Record Payload
+
+One of the core features of Hudi is the ability to incrementally upsert data,
deduplicate and merge records on the fly.
+Additionally, users can implement their custom logic to merge the input
records with the record on storage. Record
+payload is an abstract representation of a Hudi record that allows the
aforementioned capability. As we shall see below,
+Hudi provides out-of-box support for different payloads for different use
cases. But, first let us understand how record
+payload is used in the Hudi upsert path.
+
+<figure>
+ <img className="docimage"
src={require("/assets/images/upsert_path.png").default} alt="upsert_path.png" />
+</figure>
+
+Figure above shows the main stages that records go through while being written
to the Hudi table. In the precombining
+stage, Hudi performs any deduplication based on the payload implementation and
precombine key configured by the user.
+Further, on index lookup, Hudi identifies which records are being updated and
the record payload implementation tells
+Hudi how to merge the incoming record with the existing record on storage.
+
+### Existing Payloads
+
+#### OverwriteWithLatestAvroPayload
+
+This is the default record payload implementation. It picks the record with
the greatest value (determined by calling
+`.compareTo()` on the value of precombine key) to break ties and simply picks
the latest record while merging. This gives
+latest-write-wins style semantics.
+
+#### DefaultHoodieRecordPayload
+While `OverwriteWithLatestAvroPayload` precombines based on an ordering field
and picks the latest record while merging,
+`DefaultHoodieRecordPayload` honors the ordering field for both precombinig
and merging. Let's understand the difference with an example:
+
+Let's say the ordering field is `ts` and record key is `id` and schema is:
+
+```
+{
+ [
+ {"name":"id","type":"string"},
+ {"name":"ts","type":"long"},
+ {"name":"name","type":"string"},
+ {"name":"price","type":"string"}
+ ]
+}
+```
+
+Current record in storage:
+
+```
+ id ts name price
+ 1 2 name_2 price_2
+```
+
+Incoming record:
+
+```
+ id ts name price
+ 1 1 name_1 price_1
+```
+
+Result data after merging using `OverwriteWithLatestAvroPayload`
(latest-write-wins):
+
+```
+ id ts name price
+ 1 1 name_1 price_1
+```
+
+Result data after merging using `DefaultHoodieRecordPayload` (always honors
ordering field):
+
+```
+ id ts name price
+ 1 2 name_2 price_2
+```
+
+#### EventTimeAvroPayload
+
+Some use cases require merging records by event time and thus event time plays
the role of an ordering field. This
+payload is particularly useful in the case of late-arriving data. For such use
cases, users need to set
+the [payload event time field](/docs/configurations#RECORD_PAYLOAD)
configuration.
+
+#### OverwriteNonDefaultsWithLatestAvroPayload
+
+This payload is quite similar to `OverwriteWithLatestAvroPayload` with slight
difference while merging records. For
+precombining, just like `OverwriteWithLatestAvroPayload`, it picks the latest
record for a key, based on an ordering
+field. While merging, it overwrites the existing record on storage only for
the specified **fields that don't equal
+default value** for that field.
+
+#### PartialUpdateAvroPayload
+
+This payload supports partial update. Typically, once the merge step resolves
which record to pick, then the record on
+storage is fully replaced by the resolved record. But, in some cases, the
requirement is to update only certain fields
+and not replace the whole record. This is called partial update.
`PartialUpdateAvroPayload` provides out-of-box support
+for such use cases. To illustrate the point, let us look at a simple example:
+
+Let's say the ordering field is `ts` and record key is `id` and schema is:
+
+```
+{
+ [
+ {"name":"id","type":"string"},
+ {"name":"ts","type":"long"},
+ {"name":"name","type":"string"},
+ {"name":"price","type":"string"}
+ ]
+}
+```
+
+Current record in storage:
+
+```
+ id ts name price
+ 1 2 name_1 null
+```
+
+Incoming record:
+
+```
+ id ts name price
+ 1 1 null price_1
+```
+
+Result data after merging using `PartialUpdateAvroPayload`:
+
+```
+ id ts name price
+ 1 2 name_1 price_1
+```
+
+### Summary
+
+In this document, we highlighted the role of record payload to support fast
incremental ETL with updates and deletes. We
+also talked about some payload implementations readily provided by Hudi. There
are quite a few other implementations
+and developers would be interested in looking at the hierarchy of
`HoodieRecordPayload` interface. For
+example, `MySqlDebeziumAvroPayload` and `PostgresDebeziumAvroPayload` provides
support for seamlessly applying changes
+captured via Debezium for MySQL and PostgresDB. `AWSDmsAvroPayload` provides
support for applying changes captured via
+Amazon Database Migration Service onto S3.
+
+Record payloads are tunable to suit many use cases. Please check out the
configurations
+listed [here](/docs/configurations#RECORD_PAYLOAD). Moreover, if users want to
implement their own custom merge logic,
+please check
+out [this
FAQ](/docs/faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage).
In a
+separate document, we will talk about a new record merger API for optimized
payload handling.
diff --git a/website/sidebars.js b/website/sidebars.js
index 2097e94f731..464f8a171ce 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -31,6 +31,7 @@ module.exports = {
'schema_evolution',
'key_generation',
'concurrency_control',
+ 'record_payload'
],
},
{
diff --git a/website/static/assets/images/upsert_path.png
b/website/static/assets/images/upsert_path.png
new file mode 100644
index 00000000000..3321f1f75b8
Binary files /dev/null and b/website/static/assets/images/upsert_path.png differ