[GitHub] [hudi] kywe665 commented on a change in pull request #4049: [HUDI-2726] - Docs for HoodieSnapshotExporter

GitBox Tue, 23 Nov 2021 11:22:12 -0800


kywe665 commented on a change in pull request #4049:
URL: https://github.com/apache/hudi/pull/4049#discussion_r755438799




##########
File path: website/docs/snapshot_exporter.md
##########
@@ -0,0 +1,115 @@
+---
+title: Snapshot Exporter
+keywords: [hudi, snapshotexporter, export]
+toc: true
+---
+
+## Introduction
+HoodieSnapshotExporter allows you to copy data from one location to another 
for backups or other purposes. 
+You can write data as Hudi, Json, or Parquet file formats. In addition to 
copying data, you can also repartition data 

Review comment:
       fixed

##########
File path: website/docs/snapshot_exporter.md
##########
@@ -0,0 +1,115 @@
+---
+title: Snapshot Exporter
+keywords: [hudi, snapshotexporter, export]
+toc: true
+---
+
+## Introduction
+HoodieSnapshotExporter allows you to copy data from one location to another 
for backups or other purposes. 
+You can write data as Hudi, Json, or Parquet file formats. In addition to 
copying data, you can also repartition data 
+with a provided field or implement custom repartitioning by extending a class 
shown in detail below.
+
+## Arguments
+HoodieSnapshotExporter accepts a reference to a source path and a destination 
path. The utility will issue a 
+query, perform any repartitioning if required and will write the data as Hudi, 
parquet, or json format.
+
+|Argument|Description|Required|Note|
+|------------|--------|-----------|--|
+|--source-base-path|Base path for the source Hudi dataset to be 
snapshotted|required||
+|--target-output-path|Output path for storing a particular snapshot|required||
+|--output-format|Output format for the exported dataset; accept these values: 
json,parquet,hudi|required||
+|--output-partition-field|A field to be used by Spark 
repartitioning|optional|Ignored when "Hudi" or when --output-partitioner is 
specified.The output dataset's default partition field will inherent from the 
source Hudi dataset.|
+|--output-partitioner|A class to facilitate custom 
repartitioning|optional|Ignored when using output-format "Hudi"|
+
+## Examples
+
+### Copy a Hudi dataset
+
+Exporter scans the source dataset and then makes a copy of it to the target 
output path.
+```bash
+spark-submit \
+  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar" \
+  --deploy-mode "client" \
+  --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
+      
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
 \
+  --source-base-path "/tmp/" \
+  --target-output-path "/tmp/exported/hudi/" \
+  --output-format "hudi"
+```
+
+### Export to json or parquet dataset
+The Exporter can also convert the source dataset into other formats. Currently 
only "json" and "parquet" are supported.
+
+```bash
+spark-submit \
+  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar" \
+  --deploy-mode "client" \
+  --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
+      
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
 \
+  --source-base-path "/tmp/" \
+  --target-output-path "/tmp/exported/json/" \
+  --output-format "json"  # or "parquet"
+```
+
+### Re-partitioning
+When exporting to a different format, the Exporter takes the 
`--output-partition-field` parameter to do some custom re-partitioning.
+Note: All `_hoodie_*` metadata fields will be stripped during export, so make 
sure to use an existing non-metadata field as the output partitions.
+
+By default, if no partitioning parameters are given, the output dataset will 
have no partition.
+
+Example:
+```bash
+spark-submit \
+  --jars 
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar" \
+  --deploy-mode "client" \
+  --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
+      
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.6.0-SNAPSHOT.jar
 \  
+  --source-base-path "/tmp/" \
+  --target-output-path "/tmp/exported/json/" \
+  --output-format "json" \
+  --output-partition-field "symbol"  # assume the source dataset contains a 
field `symbol`
+```
+
+The output directory will look like this
+
+```bash
+`_SUCCESS symbol=AMRS symbol=AYX symbol=CDMO symbol=CRC symbol=DRNA ...`
+```
+
+### Custom Re=partitioning

Review comment:
       fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] kywe665 commented on a change in pull request #4049: [HUDI-2726] - Docs for HoodieSnapshotExporter

Reply via email to