[hudi] branch asf-site updated: [HUDI-769] Add blog for HoodieMultiTableDeltaStreamer (#2073)

vinoyang Fri, 18 Sep 2020 17:12:57 -0700

This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new d5f3f65  [HUDI-769] Add blog for HoodieMultiTableDeltaStreamer (#2073)
d5f3f65 is described below

commit d5f3f65df714abd980f59167e4766b4fab69dd72
Author: Pratyaksh Sharma <[email protected]>
AuthorDate: Sat Sep 19 05:42:34 2020 +0530

    [HUDI-769] Add blog for HoodieMultiTableDeltaStreamer (#2073)
---
 docs/_data/authors.yml                             |   4 +
 docs/_docs/2_2_writing_data.md                     |   2 +
 ...2020-08-22-ingest-multiple-tables-using-hudi.md | 104 +++++++++++++++++++++
 3 files changed, 110 insertions(+)

diff --git a/docs/_data/authors.yml b/docs/_data/authors.yml
index ed4a29e..2e41201 100644
--- a/docs/_data/authors.yml
+++ b/docs/_data/authors.yml
@@ -27,3 +27,7 @@ vinoyang:
 vbalaji:
     name: Balaji Varadarajan
     web: https://cwiki.apache.org/confluence/display/~vbalaji
+
+pratyakshsharma:
+    name: Pratyaksh Sharma
+    web: https://cwiki.apache.org/confluence/display/~pratyakshsharma
diff --git a/docs/_docs/2_2_writing_data.md b/docs/_docs/2_2_writing_data.md
index 9b63d13..5c0a76b 100644
--- a/docs/_docs/2_2_writing_data.md
+++ b/docs/_docs/2_2_writing_data.md
@@ -210,6 +210,8 @@ Sample config files for table wise overridden properties 
can be found under `hud
   --op BULK_INSERT
 ```
 
+For detailed information on how to configure and use 
`HoodieMultiTableDeltaStreamer`, please refer [blog 
section](/blog/ingest-multiple-tables-using-hudi).
+
 ## Datasource Writer
 
 The `hudi-spark` module offers the DataSource API to write (and read) a Spark 
DataFrame into a Hudi table. There are a number of options available:
diff --git a/docs/_posts/2020-08-22-ingest-multiple-tables-using-hudi.md 
b/docs/_posts/2020-08-22-ingest-multiple-tables-using-hudi.md
new file mode 100644
index 0000000..f5a4aaf
--- /dev/null
+++ b/docs/_posts/2020-08-22-ingest-multiple-tables-using-hudi.md
@@ -0,0 +1,104 @@
+---
+title: "Ingest multiple tables using Hudi"
+excerpt: "Ingesting multiple tables using Hudi at a single go is now possible. 
This blog gives a detailed explanation of how to achieve the same using 
`HoodieMultiTableDeltaStreamer.java`"
+author: pratyakshsharma
+category: blog
+---
+
+When building a change data capture pipeline for already existing or newly 
created relational databases, one of the most common problems that one faces is 
simplifying the onboarding process for multiple tables. Ingesting multiple 
tables to Hudi dataset at a single go is now possible using 
`HoodieMultiTableDeltaStreamer` class which is a wrapper on top of the more 
popular `HoodieDeltaStreamer` class. Currently `HoodieMultiTableDeltaStreamer` 
supports **COPY_ON_WRITE** storage type only an [...]
+
+This blog will guide you through configuring and running 
`HoodieMultiTableDeltaStreamer`.
+
+### Configuration
+
+ - `HoodieMultiTableDeltaStreamer` expects users to maintain table wise 
overridden properties in separate files in a dedicated config folder. Common 
properties can be configured via common properties file also.
+ - By default, hudi datasets are created under the path 
`<base-path-prefix>/<database_name>/<name_of_table_to_be_ingested>`. You need 
to provide the names of tables to be ingested via the property 
`hoodie.deltastreamer.ingestion.tablesToBeIngested` in the format 
`<database>.<table>`, for example 
+ 
+```java
+hoodie.deltastreamer.ingestion.tablesToBeIngested=db1.table1,db2.table2
+``` 
+ 
+ - If you do not provide database name, then it is assumed the table belongs 
to default database and the hudi dataset for the concerned table is created 
under the path `<base-path-prefix>/default/<name_of_table_to_be_ingested>`. 
Also there is a provision to override the default path for hudi datasets. You 
can create hudi dataset for a particular table by setting the property 
`hoodie.deltastreamer.ingestion.targetBasePath` in table level config file
+ - There are a lot of properties that one might like to override per table, 
for example
+ 
+```java
+hoodie.datasource.write.recordkey.field=_row_key
+hoodie.datasource.write.partitionpath.field=created_at
+hoodie.deltastreamer.source.kafka.topic=topic2
+hoodie.deltastreamer.keygen.timebased.timestamp.type=UNIX_TIMESTAMP
+hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy-MM-dd HH:mm:ss.S
+hoodie.datasource.hive_sync.table=short_trip_uber_hive_dummy_table
+hoodie.deltastreamer.ingestion.targetBasePath=s3:///temp/hudi/table1
+```  
+ 
+ - Properties like above need to be set for every table to be ingested. As 
already suggested at the beginning, users are expected to maintain separate 
config files for every table by setting the below property
+ 
+```java
+hoodie.deltastreamer.ingestion.<db>.<table>.configFile=s3:///tmp/config/config1.properties
+``` 
+
+If you do not want to set the above property for every table, you can simply 
create config files for every table to be ingested under the config folder with 
the name - `<database>_<table>_config.properties`. For example if you want to 
ingest table1 and table2 from dummy database, where config folder is set to 
`s3:///tmp/config`, then you need to create 2 config files on the given paths - 
`s3:///tmp/config/dummy_table1_config.properties` and 
`s3:///tmp/config/dummy_table2_config.properties`.
+
+ - Finally you can specify all the common properties in a common properties 
file. Common properties file does not necessarily have to lie under config 
folder but it is advised to keep it along with other config files. This file 
will contain the below properties
+ 
+```java
+hoodie.deltastreamer.ingestion.tablesToBeIngested=db1.table1,db2.table2
+hoodie.deltastreamer.ingestion.db1.table1.configFile=s3:///tmp/config_table1.properties
+hoodie.deltastreamer.ingestion.db2.table2.configFile=s3:///tmp/config_table2.properties
+``` 
+
+### Run Command
+
+`HoodieMultiTableDeltaStreamer` can be run similar to how one runs 
`HoodieDeltaStreamer`. Please refer to the example given below for the command. 
+
+
+### Example
+
+Suppose you want to ingest table1 and table2 from db1 and want to ingest the 2 
tables under the path `s3:///temp/hudi`. You can ingest them using the below 
command
+
+```java
+[hoodie]$ spark-submit --class 
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer `ls 
packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+  --props s3:///temp/hudi-ingestion-config/kafka-source.properties \
+  --config-folder s3:///temp/hudi-ingestion-config \
+  --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
+  --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
+  --source-ordering-field impresssiontime \
+  --base-path-prefix s3:///temp/hudi \ 
+  --target-table dummy_table \
+  --op UPSERT
+```
+
+s3:///temp/config/kafka-source.properties
+
+```java
+hoodie.deltastreamer.ingestion.tablesToBeIngested=db1.table1,db1.table2
+hoodie.deltastreamer.ingestion.db1.table1.configFile=s3:///temp/hudi-ingestion-config/config_table1.properties
+hoodie.deltastreamer.ingestion.db21.table2.configFile=s3:///temp/hudi-ingestion-config/config_table2.properties
+
+#Kafka props
+bootstrap.servers=localhost:9092
+auto.offset.reset=earliest
+schema.registry.url=http://localhost:8081
+
+hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
+```
+
+s3:///temp/hudi-ingestion-config/config_table1.properties
+
+```java
+hoodie.datasource.write.recordkey.field=_row_key1
+hoodie.datasource.write.partitionpath.field=created_at
+hoodie.deltastreamer.source.kafka.topic=topic1
+```
+
+s3:///temp/hudi-ingestion-config/config_table2.properties
+
+```java
+hoodie.datasource.write.recordkey.field=_row_key2
+hoodie.datasource.write.partitionpath.field=created_at
+hoodie.deltastreamer.source.kafka.topic=topic2
+```
+
+Contributions are welcome for extending multiple tables ingestion support to 
**MERGE_ON_READ** storage type and enabling `HoodieMultiTableDeltaStreamer` 
ingest multiple tables parallely. 
+
+Happy ingesting! 
\ No newline at end of file

[hudi] branch asf-site updated: [HUDI-769] Add blog for HoodieMultiTableDeltaStreamer (#2073)

Reply via email to