This is an automated email from the ASF dual-hosted git repository.
danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new e2d4a2b9a08 [HUDI-7403][DOCS] Support Filter/Transformer to Hudi
Exporter Utility (#11549)
e2d4a2b9a08 is described below
commit e2d4a2b9a0822dec205cf6de407e40c646745f96
Author: Vova Kolmakov <[email protected]>
AuthorDate: Tue Jul 2 07:28:01 2024 +0700
[HUDI-7403][DOCS] Support Filter/Transformer to Hudi Exporter Utility
(#11549)
Co-authored-by: Vova Kolmakov <[email protected]>
---
website/docs/snapshot_exporter.md | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/website/docs/snapshot_exporter.md
b/website/docs/snapshot_exporter.md
index aee29e3c1cc..07986d0bb8b 100644
--- a/website/docs/snapshot_exporter.md
+++ b/website/docs/snapshot_exporter.md
@@ -20,6 +20,9 @@ query, perform any repartitioning if required and will write
the data as Hudi, p
|--output-format|Output format for the exported dataset; accept these values:
json,parquet,hudi|required||
|--output-partition-field|A field to be used by Spark
repartitioning|optional|Ignored when "Hudi" or when --output-partitioner is
specified.The output dataset's default partition field will inherent from the
source Hudi dataset.|
|--output-partitioner|A class to facilitate custom
repartitioning|optional|Ignored when using output-format "Hudi"|
+|--transformer-class|A subclass of
org.apache.hudi.utilities.transform.Transformer. Allows transforming raw source
Dataset to a target Dataset (conforming to target schema) before
writing.|optional|Ignored when using output-format "Hudi". Available
transformers: org.apache.hudi.utilities.transform.SqlQueryBasedTransformer,
org.apache.hudi.utilities.transform.SqlFileBasedTransformer,
org.apache.hudi.utilities.transform.FlatteningTransformer,
org.apache.hudi.utilities.transform.AWSDmsTrans [...]
+|--transformer-sql|sql-query template be used to transform the source before
writing. The query should reference the source as a table named
"\<SRC\>".|optional|Is required for SqlQueryBasedTransformer transformer class,
ignored in other cases|
+|--transformer-sql|File with a SQL query to be executed during write. The
query should reference the source as a table named "\<SRC\>".|optional|Is
required for SqlFileBasedTransformer, ignored in other cases|
## Examples
@@ -51,6 +54,23 @@ spark-submit \
--output-format "json" # or "parquet"
```
+### Export to json or parquet dataset with transformation/filtering
+The Exporter supports custom transformation/filtering on records before
writing to json or parquet dataset. This is done by supplying
+implementation of `org.apache.hudi.utilities.transform.Transformer` via
`--transformer-class` option.
+
+```bash
+spark-submit \
+ --jars
"packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-0.15.0.jar" \
+ --deploy-mode "client" \
+ --class "org.apache.hudi.utilities.HoodieSnapshotExporter" \
+
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.15.0.jar \
+ --source-base-path "/tmp/" \
+ --target-output-path "/tmp/exported/json/" \
+ --transformer-class
"org.apache.hudi.utilities.transform.SqlQueryBasedTransformer" \
+ --transformer-sql "SELECT substr(rider,1,10) as rider, trip_type as tripType
FROM <SRC> WHERE trip_type = 'BLACK' LIMIT 10" \
+ --output-format "json" # or "parquet"
+```
+
### Re-partitioning
When exporting to a different format, the Exporter takes the
`--output-partition-field` parameter to do some custom re-partitioning.
Note: All `_hoodie_*` metadata fields will be stripped during export, so make
sure to use an existing non-metadata field as the output partitions.