This is an automated email from the ASF dual-hosted git repository.
bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 83b0744d6cf [DOCS] Update write operations page (#9619)
83b0744d6cf is described below
commit 83b0744d6cfdea9671e7ed2b1cf4fa285904f8af
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Tue Sep 12 09:57:28 2023 -0700
[DOCS] Update write operations page (#9619)
---
website/docs/write_operations.md | 49 ++++++++++++++++++++++++++++++++++++++
website/src/theme/DocPage/index.js | 2 +-
2 files changed, 50 insertions(+), 1 deletion(-)
diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md
index fc0791bcf20..2942dabf1ef 100644
--- a/website/docs/write_operations.md
+++ b/website/docs/write_operations.md
@@ -34,6 +34,55 @@ Hudi supports implementing two types of deletes on data
stored in Hudi tables, b
- Using DataSource, set `PAYLOAD_CLASS_OPT_KEY` to
`"org.apache.hudi.EmptyHoodieRecordPayload"`. This will remove all the records
in the DataSet being submitted.
- Using DataSource or Hudi Streamer, add a column named `_hoodie_is_deleted`
to DataSet. The value of this column must be set to `true` for all the records
to be deleted and either `false` or left null for any records which are to be
upserted.
+### BOOTSTRAP
+Hudi supports migrating your existing large tables into a Hudi table using the
`bootstrap` operation. There are a couple of ways to approach this. Please
refer to
+[bootstrapping page](https://hudi.apache.org/docs/migration_guide) for more
details.
+
+### INSERT_OVERWRITE
+This operation is used to rerwrite the all the partitions that are present in
the input. This operation can be faster
+than `upsert` for batch ETL jobs, that are recomputing entire target
partitions at once (as opposed to incrementally
+updating the target tables). This is because, we are able to bypass indexing,
precombining and other repartitioning
+steps in the upsert write path completely. This comes in handy if you are
doing any backfill or any such type of use-cases.
+
+### INSERT_OVERWRITE_TABLE
+This operation can be used to overwrite the entire table for whatever reason.
The Hudi cleaner will eventually clean up
+the previous table snapshot's file groups asynchronously based on the
configured cleaning policy. This operation is much
+faster than issuing explicit deletes.
+
+### DELETE_PARTITION
+In addition to deleting individual records, Hudi supports deleting entire
partitions in bulk using this operation.
+Deletion of specific partitions can be done using the config
+[`hoodie.datasource.write.partitions.to.delete`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewritepartitionstodelete).
+
+
+## Configs
+Here are thh basic configs relevant to the write operations types mentioned
above. Please refer to [Write
Options](https://hudi.apache.org/docs/configurations#Write-Options) for more
Spark based configs and [Flink
options](https://hudi.apache.org/docs/next/configurations#Flink-Options) for
Flink based configs.
+
+**Spark based configs:**
+
+| Config Name | Default |
Description
[...]
+|------------------------------------------------|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| hoodie.datasource.write.operation | upsert (Optional) |
Whether to do upsert, insert or bulk_insert for the write operation. Use
bulk_insert to load new data into a table, and there on use upsert/insert. bulk
insert uses a disk based write path to scale to load large inputs without need
to cache it.<br /><br />`Config Param: OPERATION`
[...]
+| hoodie.datasource.write.precombine.field | ts (Optional) |
Field used in preCombining before actual write. When two records have the same
key value, we will pick the one with the largest value for the precombine
field, determined by Object.compareTo(..)<br /><br />`Config Param:
PRECOMBINE_FIELD`
[...]
+| hoodie.combine.before.insert | false (Optional) | When
inserted records share same key, controls whether they should be first combined
(i.e de-duplicated) before writing to storage.<br /><br />`Config Param:
COMBINE_BEFORE_INSERT`
[...]
+| hoodie.datasource.write.insert.drop.duplicates | false (Optional) | If
set to true, records from the incoming dataframe will not overwrite existing
records with the same key during the write operation. This config is deprecated
as of 0.14.0. Please use hoodie.datasource.insert.dup.policy instead.<br /><br
/>`Config Param: INSERT_DROP_DUPS`
[...]
+| hoodie.bulkinsert.sort.mode | NONE (Optional) |
org.apache.hudi.execution.bulkinsert.BulkInsertSortMode: Modes for sorting
records during bulk insert. <ul><li>`NONE(default)`: No sorting. Fastest and
matches `spark.write.parquet()` in number of files and
overhead.</li><li>`GLOBAL_SORT`: This ensures best file sizes, with lowest
memory overhead at cost of sorting.</li><li>`PARTITION_SORT`: Strikes a balance
by only sorting within a Spark RDD partition, still keep [...]
+| hoodie.bootstrap.base.path | N/A **(Required)** |
**Applicable only when** operation type is `bootstrap`. Base path of the
dataset that needs to be bootstrapped as a Hudi table<br /><br />`Config Param:
BASE_PATH`<br />`Since Version: 0.6.0`
[...]
+| hoodie.bootstrap.mode.selector |
org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector
(Optional) | Selects the mode in which each file/partition in the
bootstrapped dataset gets bootstrapped<br />Possible
values:<ul><li>`org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector`:
In this mode, the full record data is not copied into Hudi therefore it avoids
full cost of rewriting the dataset. Instead, 'skeleton' files co [...]
+| hoodie.datasource.write.partitions.to.delete | N/A **(Required)** |
**Applicable only when** operation type is `delete_partition`. Comma separated
list of partitions to delete. Allows use of wildcard *<br /><br />`Config
Param: PARTITIONS_TO_DELETE`
[...]
+
+
+**Flink based configs:**
+
+| Config Name | Default |
Description
|
+|------------------------------------------------|----------------------|-------------------------------------------------------------------------------------|
+| write.operation | upsert (Optional) | The
write operation, that this write should do<br /><br /> `Config Param:
OPERATION`|
+| precombine.field | ts (Optional) |
Field used in preCombining before actual write. When two records have the same
key value, we will pick the one with the largest value for the precombine
field, determined by Object.compareTo(..)<br /><br /> `Config Param:
PRECOMBINE_FIELD`|
+| write.precombine | false (Optional) | Flag
to indicate whether to drop duplicates before insert/upsert. By default these
cases will accept duplicates, to gain extra performance: 1) insert operation;
2) upsert for MOR table, the MOR table deduplicate on reading<br /><br />
`Config Param: PRE_COMBINE`|
+| write.bulk_insert.sort_input | true (Optional) |
Whether to sort the inputs by specific fields for bulk insert tasks, default
true<br /><br /> `Config Param: WRITE_BULK_INSERT_SORT_INPUT`
|
+| write.bulk_insert.sort_input.by_record_key | false (Optional) |
Whether to sort the inputs by record keys for bulk insert tasks, default
false<br /><br /> `Config Param: WRITE_BULK_INSERT_SORT_INPUT_BY_RECORD_KEY`
|
+
+
## Writing path
The following is an inside look on the Hudi write path and the sequence of
events that occur during a write.
diff --git a/website/src/theme/DocPage/index.js
b/website/src/theme/DocPage/index.js
index 817f8474215..ef73ef448bc 100644
--- a/website/src/theme/DocPage/index.js
+++ b/website/src/theme/DocPage/index.js
@@ -128,7 +128,7 @@ function DocPageContent({
);
}
-const arrayOfPages = (matchPath) => [`${matchPath}/configurations`,
`${matchPath}/basic_configurations`, `${matchPath}/timeline`,
`${matchPath}/table_types`, `${matchPath}/migration_guide`,
`${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`,
`${matchPath}/metadata`, `${matchPath}/metadata_indexing`,
`${matchPath}/record_payload`, `${matchPath}/file_sizing`,
`${matchPath}/hoodie_cleaner`, `${matchPath}/concurrency_control`];
+const arrayOfPages = (matchPath) => [`${matchPath}/configurations`,
`${matchPath}/basic_configurations`, `${matchPath}/timeline`,
`${matchPath}/table_types`, `${matchPath}/migration_guide`,
`${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`,
`${matchPath}/metadata`, `${matchPath}/metadata_indexing`,
`${matchPath}/record_payload`, `${matchPath}/file_sizing`,
`${matchPath}/hoodie_cleaner`, `${matchPath}/concurrency_control`,
`${matchPath}/write_operations`];
const showCustomStylesForDocs = (matchPath, pathname) =>
arrayOfPages(matchPath).includes(pathname);
function DocPage(props) {
const {