This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 586b4638e34 [DOCS] Update Record payload page (#9562) 586b4638e34 is described below commit 586b4638e3423b26f70156d27e5932bd8513d15f Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Tue Aug 29 05:44:39 2023 -0700 [DOCS] Update Record payload page (#9562) --- website/docs/record_payload.md | 57 ++++++++++++++++++++++++++++++-------- website/src/theme/DocPage/index.js | 2 +- 2 files changed, 46 insertions(+), 13 deletions(-) diff --git a/website/docs/record_payload.md b/website/docs/record_payload.md index 3fcc05d8b6d..1ed47b2ca96 100644 --- a/website/docs/record_payload.md +++ b/website/docs/record_payload.md @@ -1,9 +1,12 @@ --- title: Record Payload keywords: [hudi, merge, upsert, precombine] +toc: true +toc_min_heading_level: 2 +toc_max_heading_level: 4 --- -## Record Payload +### Background One of the core features of Hudi is the ability to incrementally upsert data, deduplicate and merge records on the fly. Additionally, users can implement their custom logic to merge the input records with the record on storage. Record @@ -20,15 +23,38 @@ stage, Hudi performs any deduplication based on the payload implementation and p Further, on index lookup, Hudi identifies which records are being updated and the record payload implementation tells Hudi how to merge the incoming record with the existing record on storage. + +### Configs + +Payload class can be specified using the below configs. For more advanced configs refer [here](https://hudi.apache.org/docs/configurations#RECORD_PAYLOAD) + +**Spark based configs;** + +| Config Name | Default | Description | +| ---------------------------------------| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| hoodie.datasource.write.payload.class | org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional) | Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective<br /><br />`Config Param: WRITE_PAYLOAD_CLASS_NAME` | + +**Flink based configs:** + +| Config Name | Default | Description | +| ---------------------------------------| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| payload.class | org.apache.hudi.common.model.EventTimeAvroPayload (Optional) | Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for the option in-effective<br /><br /> `Config Param: PAYLOAD_CLASS_NAME` | + ### Existing Payloads #### OverwriteWithLatestAvroPayload +```scala +hoodie.datasource.write.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload +``` This is the default record payload implementation. It picks the record with the greatest value (determined by calling `.compareTo()` on the value of precombine key) to break ties and simply picks the latest record while merging. This gives latest-write-wins style semantics. #### DefaultHoodieRecordPayload +```scala +hoodie.datasource.write.payload.class=org.apache.hudi.common.model.DefaultHoodieRecordPayload +``` While `OverwriteWithLatestAvroPayload` precombines based on an ordering field and picks the latest record while merging, `DefaultHoodieRecordPayload` honors the ordering field for both precombinig and merging. Let's understand the difference with an example: @@ -74,20 +100,26 @@ Result data after merging using `DefaultHoodieRecordPayload` (always honors orde ``` #### EventTimeAvroPayload - -Some use cases require merging records by event time and thus event time plays the role of an ordering field. This -payload is particularly useful in the case of late-arriving data. For such use cases, users need to set -the [payload event time field](/docs/configurations#RECORD_PAYLOAD) configuration. +```scala +hoodie.datasource.write.payload.class=org.apache.hudi.common.model.EventTimeAvroPayload +``` +This is the default record payload for Flink based writing. Some use cases require merging records by event time and +thus event time plays the role of an ordering field. This payload is particularly useful in the case of late-arriving data. +For such use cases, users need to set the [payload event time field](/docs/configurations#RECORD_PAYLOAD) configuration. #### OverwriteNonDefaultsWithLatestAvroPayload - +```scala +hoodie.datasource.write.payload.class=org.apache.hudi.common.model.OverwriteNonDefaultsWithLatestAvroPayload +``` This payload is quite similar to `OverwriteWithLatestAvroPayload` with slight difference while merging records. For precombining, just like `OverwriteWithLatestAvroPayload`, it picks the latest record for a key, based on an ordering field. While merging, it overwrites the existing record on storage only for the specified **fields that don't equal default value** for that field. #### PartialUpdateAvroPayload - +```scala +hoodie.datasource.write.payload.class=org.apache.hudi.common.model.PartialUpdateAvroPayload +``` This payload supports partial update. Typically, once the merge step resolves which record to pick, then the record on storage is fully replaced by the resolved record. But, in some cases, the requirement is to update only certain fields and not replace the whole record. This is called partial update. `PartialUpdateAvroPayload` provides out-of-box support @@ -132,12 +164,13 @@ Result data after merging using `PartialUpdateAvroPayload`: In this document, we highlighted the role of record payload to support fast incremental ETL with updates and deletes. We also talked about some payload implementations readily provided by Hudi. There are quite a few other implementations and developers would be interested in looking at the hierarchy of `HoodieRecordPayload` interface. For -example, `MySqlDebeziumAvroPayload` and `PostgresDebeziumAvroPayload` provides support for seamlessly applying changes -captured via Debezium for MySQL and PostgresDB. `AWSDmsAvroPayload` provides support for applying changes captured via -Amazon Database Migration Service onto S3. +example, [`MySqlDebeziumAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/MySqlDebeziumAvroPayload.java) +and [`PostgresDebeziumAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/debezium/PostgresDebeziumAvroPayload.java) +provides support for seamlessly applying changes captured via Debezium for MySQL and PostgresDB. +[`AWSDmsAvroPayload`](https://github.com/apache/hudi/blob/e76dd102bcaf8aec5a932e7277ccdbfd73ce1a32/hudi-common/src/main/java/org/apache/hudi/common/model/AWSDmsAvroPayload.java) +provides support for applying changes captured via Amazon Database Migration Service onto S3. Record payloads are tunable to suit many use cases. Please check out the configurations listed [here](/docs/configurations#RECORD_PAYLOAD). Moreover, if users want to implement their own custom merge logic, -please check -out [this FAQ](/docs/faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a +please check out [this FAQ](/docs/faq/#can-i-implement-my-own-logic-for-how-input-records-are-merged-with-record-on-storage). In a separate document, we will talk about a new record merger API for optimized payload handling. diff --git a/website/src/theme/DocPage/index.js b/website/src/theme/DocPage/index.js index 640f37b3915..978dfb6318f 100644 --- a/website/src/theme/DocPage/index.js +++ b/website/src/theme/DocPage/index.js @@ -128,7 +128,7 @@ function DocPageContent({ ); } -const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, `${matchPath}/basic_configurations`, `${matchPath}/timeline`, `${matchPath}/table_types`, `${matchPath}/migration_guide`, `${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`, `${matchPath}/metadata`, `${matchPath}/metadata_indexing`]; +const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, `${matchPath}/basic_configurations`, `${matchPath}/timeline`, `${matchPath}/table_types`, `${matchPath}/migration_guide`, `${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`, `${matchPath}/metadata`, `${matchPath}/metadata_indexing`, `${matchPath}/record_payload`]; const showCustomStylesForDocs = (matchPath, pathname) => arrayOfPages(matchPath).includes(pathname); function DocPage(props) { const {