[incubator-hudi] branch asf-site updated: [HUDI-403] Publish deployment guide for writing to Hudi using HoodieDeltaStreamer and Spark Data Source

vbalaji Thu, 23 Jan 2020 15:48:04 -0800

This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 41754bb  [HUDI-403] Publish deployment guide for writing to Hudi using 
HoodieDeltaStreamer and Spark Data Source
41754bb is described below

commit 41754bb31bb8656d0570371ba2283c987f9a8c22
Author: Balaji Varadarajan <[email protected]>
AuthorDate: Tue Jan 21 15:44:53 2020 -0800

    [HUDI-403] Publish deployment guide for writing to Hudi using 
HoodieDeltaStreamer and Spark Data Source
---
 docs/_docs/2_6_deployment.md | 130 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 126 insertions(+), 4 deletions(-)

diff --git a/docs/_docs/2_6_deployment.md b/docs/_docs/2_6_deployment.md
index 295f8e8..6fdd680 100644
--- a/docs/_docs/2_6_deployment.md
+++ b/docs/_docs/2_6_deployment.md
@@ -11,9 +11,9 @@ This section provides all the help you need to deploy and 
operate Hudi tables at
 Specifically, we will cover the following aspects.
 
  - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
- - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices
+ - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices.
  - [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
- - [Interacting via CLI](#cli) : Using the CLI to perform maintenance or 
deeper introspection
+ - [Interacting via CLI](#cli) : Using the CLI to perform maintenance or 
deeper introspection.
  - [Monitoring](#monitoring) : Tracking metrics from your hudi tables using 
popular tools.
  - [Troubleshooting](#troubleshooting) : Uncovering, triaging and resolving 
issues in production.
  
@@ -23,7 +23,129 @@ All in all, Hudi deploys with no long running servers or 
additional infrastructu
 using existing infrastructure and its heartening to see other systems adopting 
similar approaches as well. Hudi writing is done via Spark jobs (DeltaStreamer 
or custom Spark datasource jobs), deployed per standard Apache Spark 
[recommendations](https://spark.apache.org/docs/latest/cluster-overview.html).
 Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or Presto and hence no additional infrastructure is necessary. 
 
+A typical Hudi data ingestion can be achieved in 2 modes. In a singe run mode, 
Hudi ingestion reads next batch of data, ingest them to Hudi table and exits. 
In continuous mode, Hudi ingestion runs as a long-running service executing 
ingestion in a loop.
 
+With Merge_On_Read Table, Hudi ingestion needs to also take care of compacting 
delta files. Again, compaction can be performed in an asynchronous-mode by 
letting compaction run concurrently with ingestion or in a serial fashion with 
one after another.
+
+### DeltaStreamer
+
+[DeltaStreamer](/docs/writing_data.html#deltastreamer) is the standalone 
utility to incrementally pull upstream changes from varied sources such as DFS, 
Kafka and DB Changelogs and ingest them to hudi tables. It runs as a spark 
application in 2 modes.
+
+ - **Run Once Mode** : In this mode, Deltastreamer performs one ingestion 
round which includes incrementally pulling events from upstream sources and 
ingesting them to hudi table. Background operations like cleaning old file 
versions and archiving hoodie timeline are automatically executed as part of 
the run. For Merge-On-Read tables, Compaction is also run inline as part of 
ingestion unless disabled by passing the flag "--disable-compaction". By 
default, Compaction is run inline for eve [...]
+
+Here is an example invocation for reading from kafka topic in a single-run 
mode and writing to Merge On Read table type in a yarn cluster.
+
+```java
+[hoodie]$ spark-submit --packages 
org.apache.hudi:hudi-utilities-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
+ --master yarn \
+ --deploy-mode cluster \
+ --num-executors 10 \
+ --executor-memory 3g \
+ --driver-memory 6g \
+ --conf spark.driver.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/varadarb_ds_driver.hprof" 
\
+ --conf spark.executor.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/varadarb_ds_executor.hprof" \
+ --queue hadoop-platform-queue \
+ --conf spark.scheduler.mode=FAIR \
+ --conf spark.yarn.executor.memoryOverhead=1072 \
+ --conf spark.yarn.driver.memoryOverhead=2048 \
+ --conf spark.task.cpus=1 \
+ --conf spark.executor.cores=1 \
+ --conf spark.task.maxFailures=10 \
+ --conf spark.memory.fraction=0.4 \
+ --conf spark.rdd.compress=true \
+ --conf spark.kryoserializer.buffer.max=200m \
+ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
+ --conf spark.memory.storageFraction=0.1 \
+ --conf spark.shuffle.service.enabled=true \
+ --conf spark.sql.hive.convertMetastoreParquet=false \
+ --conf spark.ui.port=5555 \
+ --conf spark.driver.maxResultSize=3g \
+ --conf spark.executor.heartbeatInterval=120s \
+ --conf spark.network.timeout=600s \
+ --conf spark.eventLog.overwrite=true \
+ --conf spark.eventLog.enabled=true \
+ --conf spark.eventLog.dir=hdfs:///user/spark/applicationHistory \
+ --conf spark.yarn.max.executor.failures=10 \
+ --conf spark.sql.catalogImplementation=hive \
+ --conf spark.sql.shuffle.partitions=100 \
+ --driver-class-path $HADOOP_CONF_DIR \
+ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+ --table-type MERGE_ON_READ \
+ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
+ --source-ordering-field ts  \
+ --target-base-path /user/hive/warehouse/stock_ticks_mor \
+ --target-table stock_ticks_mor \
+ --props /var/demo/config/kafka-source.properties \
+ --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider
+```
+
+ - **Continuous Mode** :  Here, deltastreamer runs an infinite loop with each 
round performing one ingestion round as described in **Run Once Mode**. The 
frequency of data ingestion can be controlled by the configuration 
"--min-sync-interval-seconds". For Merge-On-Read tables, Compaction is run in 
asynchronous fashion concurrently with ingestion unless disabled by passing the 
flag "--disable-compaction". Every ingestion run triggers a compaction request 
asynchronously and this frequency  [...]
+
+Here is an example invocation for reading from kafka topic in a continuous 
mode and writing to Merge On Read table type in a yarn cluster.
+
+```java
+[hoodie]$ spark-submit --packages 
org.apache.hudi:hudi-utilities-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
+ --master yarn \
+ --deploy-mode cluster \
+ --num-executors 10 \
+ --executor-memory 3g \
+ --driver-memory 6g \
+ --conf spark.driver.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/varadarb_ds_driver.hprof" 
\
+ --conf spark.executor.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/varadarb_ds_executor.hprof" \
+ --queue hadoop-platform-queue \
+ --conf spark.scheduler.mode=FAIR \
+ --conf spark.yarn.executor.memoryOverhead=1072 \
+ --conf spark.yarn.driver.memoryOverhead=2048 \
+ --conf spark.task.cpus=1 \
+ --conf spark.executor.cores=1 \
+ --conf spark.task.maxFailures=10 \
+ --conf spark.memory.fraction=0.4 \
+ --conf spark.rdd.compress=true \
+ --conf spark.kryoserializer.buffer.max=200m \
+ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
+ --conf spark.memory.storageFraction=0.1 \
+ --conf spark.shuffle.service.enabled=true \
+ --conf spark.sql.hive.convertMetastoreParquet=false \
+ --conf spark.ui.port=5555 \
+ --conf spark.driver.maxResultSize=3g \
+ --conf spark.executor.heartbeatInterval=120s \
+ --conf spark.network.timeout=600s \
+ --conf spark.eventLog.overwrite=true \
+ --conf spark.eventLog.enabled=true \
+ --conf spark.eventLog.dir=hdfs:///user/spark/applicationHistory \
+ --conf spark.yarn.max.executor.failures=10 \
+ --conf spark.sql.catalogImplementation=hive \
+ --conf spark.sql.shuffle.partitions=100 \
+ --driver-class-path $HADOOP_CONF_DIR \
+ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+ --table-type MERGE_ON_READ \
+ --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
+ --source-ordering-field ts  \
+ --target-base-path /user/hive/warehouse/stock_ticks_mor \
+ --target-table stock_ticks_mor \
+ --props /var/demo/config/kafka-source.properties \
+ --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
+ --continuous
+```
+
+### Spark Datasource Writer Jobs
+
+As described in [Writing Data](/docs/writing_data.html#datasource-writer), you 
can use spark datasource to ingest to hudi table. This mechanism allows you to 
ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports 
spark streaming to ingest a streaming source to Hudi table. For Merge On Read 
table types, inline compaction is turned on by default which runs after every 
ingestion run. The compaction frequency can be changed by setting the property 
"hoodie.compact.inli [...]
+
+Here is an example invocation using spark datasource
+
+```java
+inputDF.write()
+       .format("org.apache.hudi")
+       .options(clientOpts) // any of the Hudi client opts can be passed in as 
well
+       .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
+       .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"partition")
+       .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
+       .option(HoodieWriteConfig.TABLE_NAME, tableName)
+       .mode(SaveMode.Append)
+       .save(basePath);
+```
+ 
 ## Upgrading 
 
 New Hudi releases are listed on the [releases page](/releases), with detailed 
notes which list all the changes, with highlights in each release. 
@@ -31,7 +153,7 @@ At the end of the day, Hudi is a storage system and with 
that comes a lot of res
 
 As general guidelines, 
 
- - We strive to keep all changes backwards compatible (i.e new code can read 
old data/timeline files) and we cannot we will provide upgrade/downgrade tools 
via the CLI
+ - We strive to keep all changes backwards compatible (i.e new code can read 
old data/timeline files) and when we cannot, we will provide upgrade/downgrade 
tools via the CLI
  - We cannot always guarantee forward compatibility (i.e old code being able 
to read data/timeline files written by a greater version). This is generally 
the norm, since no new features can be built otherwise.
    However any large such changes, will be turned off by default, for smooth 
transition to newer release. After a few releases and once enough users deem 
the feature stable in production, we will flip the defaults in a subsequent 
release.
  - Always upgrade the query bundles (mr-bundle, presto-bundle, spark-bundle) 
first and then upgrade the writers (deltastreamer, spark jobs using 
datasource). This often provides the best experience and it's easy to fix 
@@ -54,7 +176,7 @@ For more details, refer to the detailed [migration 
guide](/docs/migration_guide.
 ## CLI
 
 Once hudi has been built, the shell can be fired by via  `cd hudi-cli && 
./hudi-cli.sh`. A hudi table resides on DFS, in a location referred to as the 
`basePath` and 
-we would need this location in order to connect to a Hudi table. Hudi library 
effectively manages this table internally, using `.hoodie` subfolder to track 
all metadata
+we would need this location in order to connect to a Hudi table. Hudi library 
effectively manages this table internally, using `.hoodie` subfolder to track 
all metadata.
 
 To initialize a hudi table, use the following command.

[incubator-hudi] branch asf-site updated: [HUDI-403] Publish deployment guide for writing to Hudi using HoodieDeltaStreamer and Spark Data Source

Reply via email to