This is an automated email from the ASF dual-hosted git repository.
yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 4fc0d427a0 [DOCS] Update migration_guide.md (#6275)
4fc0d427a0 is described below
commit 4fc0d427a00cd650057c0458e3a596dfb1d58e9d
Author: Manu <[email protected]>
AuthorDate: Tue Aug 30 13:00:31 2022 +0800
[DOCS] Update migration_guide.md (#6275)
Co-authored-by: Y Ethan Guo <[email protected]>
---
website/docs/migration_guide.md | 42 +++++++++++++---------
.../version-0.11.1/migration_guide.md | 42 +++++++++++++---------
.../version-0.12.0/migration_guide.md | 42 +++++++++++++---------
3 files changed, 78 insertions(+), 48 deletions(-)
diff --git a/website/docs/migration_guide.md b/website/docs/migration_guide.md
index e7dd5c29d7..449d65c376 100644
--- a/website/docs/migration_guide.md
+++ b/website/docs/migration_guide.md
@@ -36,8 +36,29 @@ Import your existing table into a Hudi managed table. Since
all the data is Hudi
There are a few options when choosing this approach.
**Option 1**
-Use the HDFSParquetImporter tool. As the name suggests, this only works if
your existing table is in parquet file format.
-This tool essentially starts a Spark Job to read the existing parquet table
and converts it into a HUDI managed table by re-writing all the data.
+Use the HoodieDeltaStreamer tool. HoodieDeltaStreamer supports bootstrap with
--run-bootstrap command line option. There are two types of bootstrap,
+METADATA_ONLY and FULL_RECORD. METADATA_ONLY will generate just skeleton base
files with keys/footers, avoiding full cost of rewriting the dataset.
+FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table.
+
+Here is an example for running FULL_RECORD bootstrap and keeping hive style
partition with HoodieDeltaStreamer.
+```
+spark-submit --master local \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+--run-bootstrap \
+--target-base-path /tmp/hoodie/bootstrap_table \
+--target-table bootstrap_table \
+--table-type COPY_ON_WRITE \
+--hoodie-conf hoodie.bootstrap.base.path=/tmp/source_table \
+--hoodie-conf hoodie.datasource.write.recordkey.field=${KEY_FIELD} \
+--hoodie-conf hoodie.datasource.write.partitionpath.field=${PARTITION_FIELD} \
+--hoodie-conf hoodie.datasource.write.precombine.field=${PRECOMBINE_FILED} \
+--hoodie-conf
hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
+--hoodie-conf
hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
\
+--hoodie-conf
hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
\
+--hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
+--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true
+```
**Option 2**
For huge tables, this could be as simple as :
@@ -50,21 +71,10 @@ for partition in [list of partitions in source table] {
**Option 3**
Write your own custom logic of how to load an existing table into a Hudi
managed one. Please read about the RDD API
- [here](/docs/quick-start-guide). Using the HDFSParquetImporter Tool. Once
hudi has been built via `mvn clean install -DskipTests`, the shell can be
+[here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has
been built via `mvn clean install -DskipTests`, the shell can be
fired by via `cd hudi-cli && ./hudi-cli.sh`.
```java
-hudi->hdfsparquetimport
- --upsert false
- --srcPath /user/parquet/table/basepath
- --targetPath /user/hoodie/table/basepath
- --tableName hoodie_table
- --tableType COPY_ON_WRITE
- --rowKeyField _row_key
- --partitionPathField partitionStr
- --parallelism 1500
- --schemaFilePath /user/table/schema
- --format parquet
- --sparkMemory 6g
- --retry 2
+hudi->bootstrap run --srcPath /tmp/source_table --targetPath
/tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType
COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField
${PARTITION_FIELD} --sparkMaster local --hoodieConfigs
hoodie.datasource.write.hive_style_partitioning=true --selectorClass
org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
```
+Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass,
see detalis with help "bootstrap run".
diff --git a/website/versioned_docs/version-0.11.1/migration_guide.md
b/website/versioned_docs/version-0.11.1/migration_guide.md
index e7dd5c29d7..7f5ccf2d9c 100644
--- a/website/versioned_docs/version-0.11.1/migration_guide.md
+++ b/website/versioned_docs/version-0.11.1/migration_guide.md
@@ -36,8 +36,29 @@ Import your existing table into a Hudi managed table. Since
all the data is Hudi
There are a few options when choosing this approach.
**Option 1**
-Use the HDFSParquetImporter tool. As the name suggests, this only works if
your existing table is in parquet file format.
-This tool essentially starts a Spark Job to read the existing parquet table
and converts it into a HUDI managed table by re-writing all the data.
+Use the HoodieDeltaStreamer tool. HoodieDeltaStreamer supports bootstrap with
--run-bootstrap command line option. There are two types of bootstrap,
+METADATA_ONLY and FULL_RECORD. METADATA_ONLY will generate just skeleton base
files with keys/footers, avoiding full cost of rewriting the dataset.
+FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table.
+
+Here is an example for running FULL_RECORD bootstrap and keeping hive style
partition with HoodieDeltaStreamer.
+```
+spark-submit --master local \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+--run-bootstrap \
+--target-base-path /tmp/hoodie/bootstrap_table \
+--target-table bootstrap_table \
+--table-type COPY_ON_WRITE \
+--hoodie-conf hoodie.bootstrap.base.path=/tmp/source_table \
+--hoodie-conf hoodie.datasource.write.recordkey.field=${KEY_FIELD} \
+--hoodie-conf hoodie.datasource.write.partitionpath.field=${PARTITION_FIELD} \
+--hoodie-conf hoodie.datasource.write.precombine.field=${PRECOMBINE_FILED} \
+--hoodie-conf
hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
+--hoodie-conf
hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
\
+--hoodie-conf
hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
\
+--hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
+--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true
+```
**Option 2**
For huge tables, this could be as simple as :
@@ -50,21 +71,10 @@ for partition in [list of partitions in source table] {
**Option 3**
Write your own custom logic of how to load an existing table into a Hudi
managed one. Please read about the RDD API
- [here](/docs/quick-start-guide). Using the HDFSParquetImporter Tool. Once
hudi has been built via `mvn clean install -DskipTests`, the shell can be
+ [here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has
been built via `mvn clean install -DskipTests`, the shell can be
fired by via `cd hudi-cli && ./hudi-cli.sh`.
```java
-hudi->hdfsparquetimport
- --upsert false
- --srcPath /user/parquet/table/basepath
- --targetPath /user/hoodie/table/basepath
- --tableName hoodie_table
- --tableType COPY_ON_WRITE
- --rowKeyField _row_key
- --partitionPathField partitionStr
- --parallelism 1500
- --schemaFilePath /user/table/schema
- --format parquet
- --sparkMemory 6g
- --retry 2
+hudi->bootstrap run --srcPath /tmp/source_table --targetPath
/tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType
COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField
${PARTITION_FIELD} --sparkMaster local --hoodieConfigs
hoodie.datasource.write.hive_style_partitioning=true --selectorClass
org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
```
+Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass,
see detalis with help "bootstrap run".
diff --git a/website/versioned_docs/version-0.12.0/migration_guide.md
b/website/versioned_docs/version-0.12.0/migration_guide.md
index e7dd5c29d7..fa5b663f56 100644
--- a/website/versioned_docs/version-0.12.0/migration_guide.md
+++ b/website/versioned_docs/version-0.12.0/migration_guide.md
@@ -36,8 +36,29 @@ Import your existing table into a Hudi managed table. Since
all the data is Hudi
There are a few options when choosing this approach.
**Option 1**
-Use the HDFSParquetImporter tool. As the name suggests, this only works if
your existing table is in parquet file format.
-This tool essentially starts a Spark Job to read the existing parquet table
and converts it into a HUDI managed table by re-writing all the data.
+Use the HoodieDeltaStreamer tool. HoodieDeltaStreamer supports bootstrap with
--run-bootstrap command line option. There are two types of bootstrap,
+METADATA_ONLY and FULL_RECORD. METADATA_ONLY will generate just skeleton base
files with keys/footers, avoiding full cost of rewriting the dataset.
+FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table.
+
+Here is an example for running FULL_RECORD bootstrap and keeping hive style
partition with HoodieDeltaStreamer.
+```
+spark-submit --master local \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+--run-bootstrap \
+--target-base-path /tmp/hoodie/bootstrap_table \
+--target-table bootstrap_table \
+--table-type COPY_ON_WRITE \
+--hoodie-conf hoodie.bootstrap.base.path=/tmp/source_table \
+--hoodie-conf hoodie.datasource.write.recordkey.field=${KEY_FIELD} \
+--hoodie-conf hoodie.datasource.write.partitionpath.field=${PARTITION_FIELD} \
+--hoodie-conf hoodie.datasource.write.precombine.field=${PRECOMBINE_FILED} \
+--hoodie-conf
hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
+--hoodie-conf
hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
\
+--hoodie-conf
hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector
\
+--hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
+--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true
+```
**Option 2**
For huge tables, this could be as simple as :
@@ -50,21 +71,10 @@ for partition in [list of partitions in source table] {
**Option 3**
Write your own custom logic of how to load an existing table into a Hudi
managed one. Please read about the RDD API
- [here](/docs/quick-start-guide). Using the HDFSParquetImporter Tool. Once
hudi has been built via `mvn clean install -DskipTests`, the shell can be
+[here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has
been built via `mvn clean install -DskipTests`, the shell can be
fired by via `cd hudi-cli && ./hudi-cli.sh`.
```java
-hudi->hdfsparquetimport
- --upsert false
- --srcPath /user/parquet/table/basepath
- --targetPath /user/hoodie/table/basepath
- --tableName hoodie_table
- --tableType COPY_ON_WRITE
- --rowKeyField _row_key
- --partitionPathField partitionStr
- --parallelism 1500
- --schemaFilePath /user/table/schema
- --format parquet
- --sparkMemory 6g
- --retry 2
+hudi->bootstrap run --srcPath /tmp/source_table --targetPath
/tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType
COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField
${PARTITION_FIELD} --sparkMaster local --hoodieConfigs
hoodie.datasource.write.hive_style_partitioning=true --selectorClass
org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
```
+Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass,
see detalis with help "bootstrap run".
\ No newline at end of file