[hudi] branch asf-site updated: [HUDI-6852] Added spark streaming to streaming ingestion page (#9701)

sivabalan Fri, 15 Sep 2023 23:23:58 -0700

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 8c74eeac7e1 [HUDI-6852] Added spark streaming to streaming ingestion 
page (#9701)
8c74eeac7e1 is described below

commit 8c74eeac7e148810c2535d1fb84e7498e9d227f5
Author: Sivabalan Narayanan <[email protected]>
AuthorDate: Sat Sep 16 02:23:45 2023 -0400

    [HUDI-6852] Added spark streaming to streaming ingestion page (#9701)
---
 ...tastreamer.md => hoodie_streaming_ingestion.md} | 189 ++++++++++++++++++++-
 website/docs/overview.md                           |   2 +-
 website/sidebars.js                                |   2 +-
 website/versioned_docs/version-0.10.0/overview.md  |   2 +-
 website/versioned_docs/version-0.10.1/overview.md  |   2 +-
 website/versioned_docs/version-0.11.0/overview.md  |   2 +-
 website/versioned_docs/version-0.11.1/overview.md  |   2 +-
 website/versioned_docs/version-0.12.0/overview.md  |   2 +-
 website/versioned_docs/version-0.12.1/overview.md  |   2 +-
 website/versioned_docs/version-0.12.2/overview.md  |   2 +-
 website/versioned_docs/version-0.12.3/overview.md  |   2 +-
 website/versioned_docs/version-0.13.0/overview.md  |   2 +-
 website/versioned_docs/version-0.13.1/overview.md  |   2 +-
 13 files changed, 200 insertions(+), 13 deletions(-)

diff --git a/website/docs/hoodie_deltastreamer.md 
b/website/docs/hoodie_streaming_ingestion.md
similarity index 88%
rename from website/docs/hoodie_deltastreamer.md
rename to website/docs/hoodie_streaming_ingestion.md
index 3ed3b4aef5e..ac4e4cb73b4 100644
--- a/website/docs/hoodie_deltastreamer.md
+++ b/website/docs/hoodie_streaming_ingestion.md
@@ -1,7 +1,9 @@
 ---
 title: Streaming Ingestion
-keywords: [hudi, streamer, hoodiestreamer]
+keywords: [hudi, streamer, hoodiestreamer, spark_streaming]
 ---
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
 
 ## Hudi Streamer
 :::danger Important
@@ -380,6 +382,191 @@ jobs: `hoodie.write.meta.key.prefixes = 
'streamer.checkpoint.key'`
 Spark SQL should be configured using this hoodie config:
 hoodie.streamer.source.sql.sql.query = 'select * from source_table'
 
+
+## Structured Streaming
+
+Hudi supports Spark Structured Streaming reads and writes.
+
+### Streaming Write
+You can write Hudi tables using spark's structured streaming. 
+
+<Tabs
+groupId="programming-language"
+defaultValue="python"
+values={[
+{ label: 'Scala', value: 'scala', },
+{ label: 'Python', value: 'python', },
+]}
+>
+
+<TabItem value="scala">
+
+```scala
+// spark-shell
+// prepare to stream write to new table
+import org.apache.spark.sql.streaming.Trigger
+
+val streamingTableName = "hudi_trips_cow_streaming"
+val baseStreamingPath = "file:///tmp/hudi_trips_cow_streaming"
+val checkpointLocation = "file:///tmp/checkpoints/hudi_trips_cow_streaming"
+
+// create streaming df
+val df = spark.readStream.
+        format("hudi").
+        load(basePath)
+
+// write stream to new hudi table
+df.writeStream.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, streamingTableName).
+  outputMode("append").
+  option("path", baseStreamingPath).
+  option("checkpointLocation", checkpointLocation).
+  trigger(Trigger.Once()).
+  start()
+
+```
+
+</TabItem>
+<TabItem value="python">
+
+```python
+# pyspark
+# prepare to stream write to new table
+streamingTableName = "hudi_trips_cow_streaming"
+baseStreamingPath = "file:///tmp/hudi_trips_cow_streaming"
+checkpointLocation = "file:///tmp/checkpoints/hudi_trips_cow_streaming"
+
+hudi_streaming_options = {
+    'hoodie.table.name': streamingTableName,
+    'hoodie.datasource.write.recordkey.field': 'uuid',
+    'hoodie.datasource.write.partitionpath.field': 'partitionpath',
+    'hoodie.datasource.write.table.name': streamingTableName,
+    'hoodie.datasource.write.operation': 'upsert',
+    'hoodie.datasource.write.precombine.field': 'ts',
+    'hoodie.upsert.shuffle.parallelism': 2,
+    'hoodie.insert.shuffle.parallelism': 2
+}
+
+# create streaming df
+df = spark.readStream \
+    .format("hudi") \
+    .load(basePath)
+
+# write stream to new hudi table
+df.writeStream.format("hudi") \
+    .options(**hudi_streaming_options) \
+    .outputMode("append") \
+    .option("path", baseStreamingPath) \
+    .option("checkpointLocation", checkpointLocation) \
+    .trigger(once=True) \
+    .start()
+
+```
+
+</TabItem>
+
+</Tabs
+>
+
+### Streaming Read
+
+Structured Streaming reads are based on Hudi's Incremental Query feature, 
therefore streaming read can return data for which
+commits and base files were not yet removed by the cleaner. You can control 
commits retention time.
+
+<Tabs
+groupId="programming-language"
+defaultValue="python"
+values={[
+{ label: 'Scala', value: 'scala', },
+{ label: 'Python', value: 'python', },
+]}
+>
+
+<TabItem value="scala">
+
+```scala
+// spark-shell
+// reload data
+df.write.format("hudi").
+  options(getQuickstartWriteConfigs).
+  option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+  option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+  option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+  option(TABLE_NAME, tableName).
+  mode(Overwrite).
+  save(basePath)
+
+// read stream and output results to console
+spark.readStream.
+  format("hudi").
+  load(basePath).
+  writeStream.
+  format("console").
+  start()
+
+// read stream to streaming df
+val df = spark.readStream.
+        format("hudi").
+        load(basePath)
+
+```
+</TabItem>
+
+<TabItem value="python">
+
+```python
+# pyspark
+# reload data
+inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(
+    dataGen.generateInserts(10))
+df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+
+hudi_options = {
+    'hoodie.table.name': tableName,
+    'hoodie.datasource.write.recordkey.field': 'uuid',
+    'hoodie.datasource.write.partitionpath.field': 'partitionpath',
+    'hoodie.datasource.write.table.name': tableName,
+    'hoodie.datasource.write.operation': 'upsert',
+    'hoodie.datasource.write.precombine.field': 'ts',
+    'hoodie.upsert.shuffle.parallelism': 2,
+    'hoodie.insert.shuffle.parallelism': 2
+}
+
+df.write.format("hudi"). \
+    options(**hudi_options). \
+    mode("overwrite"). \
+    save(basePath)
+
+# read stream to streaming df
+df = spark.readStream \
+    .format("hudi") \
+    .load(basePath)
+
+# ead stream and output results to console
+spark.readStream \
+    .format("hudi") \
+    .load(basePath) \
+    .writeStream \
+    .format("console") \
+    .start()
+
+```
+</TabItem>
+
+</Tabs
+>
+
+
+:::info
+Spark SQL can be used within ForeachBatch sink to do INSERT, UPDATE, DELETE 
and MERGE INTO.
+Target table must exist before write.
+:::
+
+
 ## Flink Ingestion
 
 ### CDC Ingestion
diff --git a/website/docs/overview.md b/website/docs/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/docs/overview.md
+++ b/website/docs/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines. 
diff --git a/website/sidebars.js b/website/sidebars.js
index 753b23b57ff..eee5f1b0f5d 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -47,7 +47,7 @@ module.exports = {
                     ],
                 },
                 'writing_data',
-                'hoodie_deltastreamer',
+                'hoodie_streaming_ingestion',
                 'querying_data',
                 'flink_configuration',
                 {
diff --git a/website/versioned_docs/version-0.10.0/overview.md 
b/website/versioned_docs/version-0.10.0/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/versioned_docs/version-0.10.0/overview.md
+++ b/website/versioned_docs/version-0.10.0/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines. 
diff --git a/website/versioned_docs/version-0.10.1/overview.md 
b/website/versioned_docs/version-0.10.1/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/versioned_docs/version-0.10.1/overview.md
+++ b/website/versioned_docs/version-0.10.1/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines. 
diff --git a/website/versioned_docs/version-0.11.0/overview.md 
b/website/versioned_docs/version-0.11.0/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/versioned_docs/version-0.11.0/overview.md
+++ b/website/versioned_docs/version-0.11.0/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines. 
diff --git a/website/versioned_docs/version-0.11.1/overview.md 
b/website/versioned_docs/version-0.11.1/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/versioned_docs/version-0.11.1/overview.md
+++ b/website/versioned_docs/version-0.11.1/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines. 
diff --git a/website/versioned_docs/version-0.12.0/overview.md 
b/website/versioned_docs/version-0.12.0/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/versioned_docs/version-0.12.0/overview.md
+++ b/website/versioned_docs/version-0.12.0/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines. 
diff --git a/website/versioned_docs/version-0.12.1/overview.md 
b/website/versioned_docs/version-0.12.1/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/versioned_docs/version-0.12.1/overview.md
+++ b/website/versioned_docs/version-0.12.1/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines. 
diff --git a/website/versioned_docs/version-0.12.2/overview.md 
b/website/versioned_docs/version-0.12.2/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/versioned_docs/version-0.12.2/overview.md
+++ b/website/versioned_docs/version-0.12.2/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines. 
diff --git a/website/versioned_docs/version-0.12.3/overview.md 
b/website/versioned_docs/version-0.12.3/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/versioned_docs/version-0.12.3/overview.md
+++ b/website/versioned_docs/version-0.12.3/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines. 
diff --git a/website/versioned_docs/version-0.13.0/overview.md 
b/website/versioned_docs/version-0.13.0/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/versioned_docs/version-0.13.0/overview.md
+++ b/website/versioned_docs/version-0.13.0/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines. 
diff --git a/website/versioned_docs/version-0.13.1/overview.md 
b/website/versioned_docs/version-0.13.1/overview.md
index 5b1b9f12a4d..d9bb7f44a8c 100644
--- a/website/versioned_docs/version-0.13.1/overview.md
+++ b/website/versioned_docs/version-0.13.1/overview.md
@@ -13,7 +13,7 @@ how to learn more to get started.
 Apache Hudi (pronounced “hoodie”) is the next generation [streaming data lake 
platform](/blog/2021/07/21/streaming-data-lake-platform). 
 Apache Hudi brings core warehouse and database functionality directly to a 
data lake. Hudi provides [tables](/docs/next/table_management), 
 [transactions](/docs/next/timeline), [efficient 
upserts/deletes](/docs/next/write_operations), [advanced 
indexes](/docs/next/indexing), 
-[streaming ingestion services](/docs/next/hoodie_deltastreamer), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
+[streaming ingestion services](/docs/next/hoodie_streaming_ingestion), data 
[clustering](/docs/next/clustering)/[compaction](/docs/next/compaction) 
optimizations, 
 and [concurrency](/docs/next/concurrency_control) all while keeping your data 
in open source file formats.
 
 Not only is Apache Hudi great for streaming workloads, but it also allows you 
to create efficient incremental batch pipelines.

[hudi] branch asf-site updated: [HUDI-6852] Added spark streaming to streaming ingestion page (#9701)

Reply via email to