[hudi] branch asf-site updated: [DOCS] Update bootstrap page (#9338)

bhavanisudha Thu, 03 Aug 2023 12:06:08 -0700

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 8cd33fa3ad8 [DOCS] Update bootstrap page (#9338)
8cd33fa3ad8 is described below

commit 8cd33fa3ad842afd900b1aef00e93fdf1bbe6e7f
Author: Bhavani Sudha Saktheeswaran <[email protected]>
AuthorDate: Thu Aug 3 12:05:54 2023 -0700

    [DOCS] Update bootstrap page (#9338)
    
    * [HUDI-6112] Fix bugs in Doc Generation tool
    
    - Add Config Param in Description
    - Styling changes to fix table size and toc on side for better navigation
    - Bug fix in basic configs page to merge spark datasource related read and 
write configs
    
    * [DOCS] Update bootstrap page with configs
---
 website/docs/migration_guide.md    | 49 +++++++++++++++++++++++++++++++-------
 website/src/theme/DocPage/index.js |  2 +-
 2 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/website/docs/migration_guide.md b/website/docs/migration_guide.md
index 31cbdbcb956..ce36f3a38f0 100644
--- a/website/docs/migration_guide.md
+++ b/website/docs/migration_guide.md
@@ -3,6 +3,9 @@ title: Bootstrapping
 keywords: [ hudi, migration, use case]
 summary: In this page, we will discuss some available tools for migrating your 
existing table into a Hudi table
 last_modified_at: 2019-12-30T15:59:57-04:00
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
 ---
 
 Hudi maintains metadata such as commit timeline and indexes to manage a table. 
The commit timelines helps to understand the actions happening on a table as 
well as the current state of a table. Indexes are used by Hudi to maintain a 
record key to file id mapping to efficiently locate a record. At the moment, 
Hudi supports writing only parquet columnar formats.
@@ -35,12 +38,20 @@ Import your existing table into a Hudi managed table. Since 
all the data is Hudi
 
 There are a few options when choosing this approach.
 
-**Option 1**
-Use the HoodieStreamer tool. HoodieStreamer supports bootstrap with 
--run-bootstrap command line option. There are two types of bootstrap,
-METADATA_ONLY and FULL_RECORD. METADATA_ONLY will generate just skeleton base 
files with keys/footers, avoiding full cost of rewriting the dataset.
-FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table.
+#### Using Hudi Streamer
+
+Use the [Hudi Streamer](/docs/hoodie_deltastreamer#hudi-streamer) tool. 
HoodieStreamer supports bootstrap with 
+--run-bootstrap command line option. There are two types of bootstrap, 
METADATA_ONLY and FULL_RECORD. METADATA_ONLY will
+generate just skeleton base files with keys/footers, avoiding full cost of 
rewriting the dataset. FULL_RECORD will 
+perform a full copy/rewrite of the data as a Hudi table.  Additionally, once 
can choose selective partitions using regex
+patterns to apply one of the above bootstrap modes. 
+
+Here is an example for running FULL_RECORD bootstrap on all partitions that 
match the regex pattern `.*` and keeping 
+hive style partition with HoodieStreamer. This example configures 
+[hoodie.bootstrap.mode.selector](https://hudi.apache.org/docs/configurations#hoodiebootstrapmodeselector)
 to 
+`org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector`  which 
allows applying `FULL_RECORD` bootstrap 
+mode to selective partitions based on the regex pattern 
[hoodie.bootstrap.mode.selector.regex](https://hudi.apache.org/docs/configurations#hoodiebootstrapmodeselectorregex)
 
-Here is an example for running FULL_RECORD bootstrap and keeping hive style 
partition with HoodieStreamer.
 ```
 spark-submit --master local \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
@@ -54,13 +65,14 @@ spark-submit --master local \
 --hoodie-conf hoodie.datasource.write.partitionpath.field=${PARTITION_FIELD} \
 --hoodie-conf hoodie.datasource.write.precombine.field=${PRECOMBINE_FILED} \
 --hoodie-conf 
hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
---hoodie-conf 
hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
 \
 --hoodie-conf 
hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
 \
+--hoodie-conf hoodie.bootstrap.mode.selector.regex='.*' \
 --hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
 --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true
 ``` 
 
-**Option 2**
+#### Using Spark Datasource Writer
+
 For huge tables, this could be as simple as : 
 ```java
 for partition in [list of partitions in source table] {
@@ -69,7 +81,12 @@ for partition in [list of partitions in source table] {
 }
 ```  
 
-**Option 3**
+#### Using Spark SQL CALL Procedure
+
+Refer to [Bootstrap 
procedure](https://hudi.apache.org/docs/next/procedures#bootstrap) for more 
details. 
+
+#### Using Hudi CLI
+
 Write your own custom logic of how to load an existing table into a Hudi 
managed one. Please read about the RDD API
 [here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has 
been built via `mvn clean install -DskipTests`, the shell can be
 fired by via `cd hudi-cli && ./hudi-cli.sh`.
@@ -77,4 +94,18 @@ fired by via `cd hudi-cli && ./hudi-cli.sh`.
 ```java
 hudi->bootstrap run --srcPath /tmp/source_table --targetPath 
/tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType 
COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField 
${PARTITION_FIELD} --sparkMaster local --hoodieConfigs 
hoodie.datasource.write.hive_style_partitioning=true --selectorClass 
org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
 ```
-Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass, 
see detalis with help "bootstrap run".
+Unlike Hudi Streamer, FULL_RECORD or METADATA_ONLY is set with 
--selectorClass, see details with help "bootstrap run".
+
+
+## Configs
+
+Here are the basic configs that control bootstrapping.
+
+| Config Name                                         | Default            | 
Description                                                                     
                                                                                
                                                                                
                                                                                
                                                                                
                [...]
+| --------------------------------------------------- | ------------------ 
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| hoodie.bootstrap.base.path | N/A **(Required)** | Base path of the dataset 
that needs to be bootstrapped as a Hudi table<br /><br />`Config Param: 
BASE_PATH`<br />`Since Version: 0.6.0`                                          
                                                                                
                                                                                
                                                                                
                        [...]
+| hoodie.bootstrap.mode.selector                  | 
org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector 
(Optional)          | Selects the mode in which each file/partition in the 
bootstrapped dataset gets bootstrapped<br />Possible 
values:<ul><li>`org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector`:
 In this mode, the full record data is not copied into Hudi therefore it avoids 
full cost of rewriting the dataset. Instead, 'skeleton' files c [...]
+| hoodie.bootstrap.mode.selector.regex                   | .* (Optional)       
                                                                            | 
Matches each bootstrap dataset partition against this regex and applies the 
mode below to it. This is **applicable only when** 
`hoodie.bootstrap.mode.selector` equals 
`org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector`<br /><br 
/>`Config Param: PARTITION_SELECTOR_REGEX_PATTERN`<br />`Since Version: 0.6.0`  
         [...]
+| hoodie.bootstrap.mode.selector.regex.mode             | METADATA_ONLY 
(Optional)                                                                      
  | When specified, applies one of the possible <u>[Bootstrap 
Modes](https://github.com/apache/hudi/blob/bc583b4158684c23f35d787de5afda13c2865ad4/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/bootstrap/BootstrapMode.java)</u>
 to the partitions that match the regex provided as part of the 
`hoodie.bootstrap.mode.select [...]
+
+By default, with only `hoodie.bootstrap.base.path` being provided 
METADATA_ONLY mode is selected. For other options, please refer [bootstrap 
configs](https://hudi.apache.org/docs/next/configurations#Bootstrap-Configs) 
for more details.
diff --git a/website/src/theme/DocPage/index.js 
b/website/src/theme/DocPage/index.js
index fb117ec8024..3e4e22077c4 100644
--- a/website/src/theme/DocPage/index.js
+++ b/website/src/theme/DocPage/index.js
@@ -128,7 +128,7 @@ function DocPageContent({
   );
 }
 
-const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, 
`${matchPath}/basic_configurations`, `${matchPath}/timeline`, 
`${matchPath}/table_types`];
+const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, 
`${matchPath}/basic_configurations`, `${matchPath}/timeline`, 
`${matchPath}/table_types`, `${matchPath}/migration_guide`];
 const showCustomStylesForDocs = (matchPath, pathname) => 
arrayOfPages(matchPath).includes(pathname);
 function DocPage(props) {
   const {

[hudi] branch asf-site updated: [DOCS] Update bootstrap page (#9338)

Reply via email to