[hudi] branch asf-site updated: [DOCS] Update write operations page (#9619)

bhavanisudha Tue, 12 Sep 2023 09:57:39 -0700

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 83b0744d6cf [DOCS] Update write operations page (#9619)
83b0744d6cf is described below

commit 83b0744d6cfdea9671e7ed2b1cf4fa285904f8af
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Tue Sep 12 09:57:28 2023 -0700

    [DOCS] Update write operations page (#9619)
---
 website/docs/write_operations.md   | 49 ++++++++++++++++++++++++++++++++++++++
 website/src/theme/DocPage/index.js |  2 +-
 2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md
index fc0791bcf20..2942dabf1ef 100644
--- a/website/docs/write_operations.md
+++ b/website/docs/write_operations.md
@@ -34,6 +34,55 @@ Hudi supports implementing two types of deletes on data 
stored in Hudi tables, b
   - Using DataSource, set `PAYLOAD_CLASS_OPT_KEY` to 
`"org.apache.hudi.EmptyHoodieRecordPayload"`. This will remove all the records 
in the DataSet being submitted. 
   - Using DataSource or Hudi Streamer, add a column named `_hoodie_is_deleted` 
to DataSet. The value of this column must be set to `true` for all the records 
to be deleted and either `false` or left null for any records which are to be 
upserted.
 
+### BOOTSTRAP
+Hudi supports migrating your existing large tables into a Hudi table using the 
`bootstrap` operation. There are a couple of ways to approach this. Please 
refer to 
+[bootstrapping page](https://hudi.apache.org/docs/migration_guide) for more 
details. 
+
+### INSERT_OVERWRITE
+This operation is used to rerwrite the all the partitions that are present in 
the input. This operation can be faster 
+than `upsert` for batch ETL jobs, that are recomputing entire target 
partitions at once (as opposed to incrementally 
+updating the target tables). This is because, we are able to bypass indexing, 
precombining and other repartitioning 
+steps in the upsert write path completely. This comes in handy if you are 
doing any backfill or any such type of use-cases.
+
+### INSERT_OVERWRITE_TABLE
+This operation can be used to overwrite the entire table for whatever reason. 
The Hudi cleaner will eventually clean up 
+the previous table snapshot's file groups asynchronously based on the 
configured cleaning policy. This operation is much 
+faster than issuing explicit deletes. 
+
+### DELETE_PARTITION
+In addition to deleting individual records, Hudi supports deleting entire 
partitions in bulk using this operation. 
+Deletion of specific partitions can be done using the config 
+[`hoodie.datasource.write.partitions.to.delete`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewritepartitionstodelete).
 
+
+
+## Configs
+Here are thh basic configs relevant to the write operations types mentioned 
above. Please refer to [Write 
Options](https://hudi.apache.org/docs/configurations#Write-Options) for more 
Spark based configs and [Flink 
options](https://hudi.apache.org/docs/next/configurations#Flink-Options) for 
Flink based configs.
+
+**Spark based configs:**
+
+| Config Name                                    | Default              | 
Description                                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                   [...]
+|------------------------------------------------|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| hoodie.datasource.write.operation              | upsert (Optional)    | 
Whether to do upsert, insert or bulk_insert for the write operation. Use 
bulk_insert to load new data into a table, and there on use upsert/insert. bulk 
insert uses a disk based write path to scale to load large inputs without need 
to cache it.<br /><br />`Config Param: OPERATION`                               
                                                                                
                           [...]
+| hoodie.datasource.write.precombine.field       | ts (Optional)        | 
Field used in preCombining before actual write. When two records have the same 
key value, we will pick the one with the largest value for the precombine 
field, determined by Object.compareTo(..)<br /><br />`Config Param: 
PRECOMBINE_FIELD`                                                               
                                                                                
                                      [...]
+| hoodie.combine.before.insert                   | false (Optional)     | When 
inserted records share same key, controls whether they should be first combined 
(i.e de-duplicated) before writing to storage.<br /><br />`Config Param: 
COMBINE_BEFORE_INSERT`                                                          
                                                                                
                                                                                
                     [...]
+| hoodie.datasource.write.insert.drop.duplicates | false (Optional)     | If 
set to true, records from the incoming dataframe will not overwrite existing 
records with the same key during the write operation. This config is deprecated 
as of 0.14.0. Please use hoodie.datasource.insert.dup.policy instead.<br /><br 
/>`Config Param: INSERT_DROP_DUPS`                                              
                                                                                
                    [...]
+| hoodie.bulkinsert.sort.mode                    | NONE (Optional)      | 
org.apache.hudi.execution.bulkinsert.BulkInsertSortMode: Modes for sorting 
records during bulk insert. <ul><li>`NONE(default)`: No sorting. Fastest and 
matches `spark.write.parquet()` in number of files and 
overhead.</li><li>`GLOBAL_SORT`: This ensures best file sizes, with lowest 
memory overhead at cost of sorting.</li><li>`PARTITION_SORT`: Strikes a balance 
by only sorting within a Spark RDD partition, still keep [...]
+| hoodie.bootstrap.base.path                     | N/A **(Required)**   | 
**Applicable only when** operation type is `bootstrap`. Base path of the 
dataset that needs to be bootstrapped as a Hudi table<br /><br />`Config Param: 
BASE_PATH`<br />`Since Version: 0.6.0`                                          
                                                                                
                                                                                
                          [...]
+| hoodie.bootstrap.mode.selector                 | 
org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector 
(Optional)          | Selects the mode in which each file/partition in the 
bootstrapped dataset gets bootstrapped<br />Possible 
values:<ul><li>`org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector`:
 In this mode, the full record data is not copied into Hudi therefore it avoids 
full cost of rewriting the dataset. Instead, 'skeleton' files co [...]
+| hoodie.datasource.write.partitions.to.delete   | N/A **(Required)**   | 
**Applicable only when** operation type is `delete_partition`. Comma separated 
list of partitions to delete. Allows use of wildcard *<br /><br />`Config 
Param: PARTITIONS_TO_DELETE`                                                    
                                                                                
                                                                                
                          [...]
+
+
+**Flink based configs:**
+
+| Config Name                                    | Default              | 
Description                                                                     
    |
+|------------------------------------------------|----------------------|-------------------------------------------------------------------------------------|
+| write.operation                                | upsert (Optional)    | The 
write operation, that this write should do<br /><br /> `Config Param: 
OPERATION`|
+| precombine.field                               | ts (Optional)        | 
Field used in preCombining before actual write. When two records have the same 
key value, we will pick the one with the largest value for the precombine 
field, determined by Object.compareTo(..)<br /><br /> `Config Param: 
PRECOMBINE_FIELD`|
+| write.precombine                               | false (Optional)     | Flag 
to indicate whether to drop duplicates before insert/upsert. By default these 
cases will accept duplicates, to gain extra performance: 1) insert operation; 
2) upsert for MOR table, the MOR table deduplicate on reading<br /><br /> 
`Config Param: PRE_COMBINE`|
+| write.bulk_insert.sort_input                   | true (Optional)      | 
Whether to sort the inputs by specific fields for bulk insert tasks, default 
true<br /><br /> `Config Param: WRITE_BULK_INSERT_SORT_INPUT`                   
                                                                                
                         |
+| write.bulk_insert.sort_input.by_record_key     | false (Optional)     | 
Whether to sort the inputs by record keys for bulk insert tasks, default 
false<br /><br /> `Config Param: WRITE_BULK_INSERT_SORT_INPUT_BY_RECORD_KEY`    
                                                                                
                             |
+
+
 ## Writing path
 The following is an inside look on the Hudi write path and the sequence of 
events that occur during a write.
 
diff --git a/website/src/theme/DocPage/index.js 
b/website/src/theme/DocPage/index.js
index 817f8474215..ef73ef448bc 100644
--- a/website/src/theme/DocPage/index.js
+++ b/website/src/theme/DocPage/index.js
@@ -128,7 +128,7 @@ function DocPageContent({
   );
 }
 
-const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, 
`${matchPath}/basic_configurations`, `${matchPath}/timeline`, 
`${matchPath}/table_types`, `${matchPath}/migration_guide`, 
`${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`, 
`${matchPath}/metadata`, `${matchPath}/metadata_indexing`, 
`${matchPath}/record_payload`, `${matchPath}/file_sizing`, 
`${matchPath}/hoodie_cleaner`, `${matchPath}/concurrency_control`];
+const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, 
`${matchPath}/basic_configurations`, `${matchPath}/timeline`, 
`${matchPath}/table_types`, `${matchPath}/migration_guide`, 
`${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`, 
`${matchPath}/metadata`, `${matchPath}/metadata_indexing`, 
`${matchPath}/record_payload`, `${matchPath}/file_sizing`, 
`${matchPath}/hoodie_cleaner`, `${matchPath}/concurrency_control`, 
`${matchPath}/write_operations`];
 const showCustomStylesForDocs = (matchPath, pathname) => 
arrayOfPages(matchPath).includes(pathname);
 function DocPage(props) {
   const {

[hudi] branch asf-site updated: [DOCS] Update write operations page (#9619)

Reply via email to