[GitHub] [iceberg] rdblue commented on a change in pull request #1261: Spark: [DOC] guide about structured streaming sink for Iceberg

GitBox Wed, 29 Jul 2020 02:32:25 -0700


rdblue commented on a change in pull request #1261:
URL: https://github.com/apache/iceberg/pull/1261#discussion_r461904586




##########
File path: site/docs/spark.md
##########
@@ -520,6 +520,28 @@ data.writeTo("prod.db.table")
     .createOrReplace()
 ```
 
+### Writing from streaming query (Structured Streaming)
+
+To write values from streaming query to Iceberg table, use `writeStream`:
+
+```scala
+data.writeStream
+    .format("iceberg")
+    .outputMode("append")
+    .option("path", pathToTable)
+    .option("checkpointLocation", checkpointPath)
+    .start()

Review comment:
       This looks specific to 2.4. Should we have a 3.0 example and a separate 
2.4 example like the other sections?
   
   An alternative is to create a new page for Spark Streaming and add the docs 
there. Then we could have a table like the one at the top of the Spark page 
that explains what is supported in different versions.

##########
File path: site/docs/spark.md
##########
@@ -520,6 +520,28 @@ data.writeTo("prod.db.table")
     .createOrReplace()
 ```
 
+### Writing from streaming query (Structured Streaming)
+
+To write values from streaming query to Iceberg table, use `writeStream`:
+
+```scala
+data.writeStream
+    .format("iceberg")
+    .outputMode("append")
+    .option("path", pathToTable)
+    .option("checkpointLocation", checkpointPath)
+    .start()
+```
+
+`append` and `complete` modes are supported. The table should be created in 
prior to start the streaming query.
+ 
+!!! Note
+    To avoid metadata growing too huge, there're several guides you may want 
to follow: 

Review comment:
       I think this is worth a section, not just a note.
   
   > Streaming queries can create new table versions quickly, which creates 
lots of table metadata to track those versions. Maintaining metadata by tuning 
the rate of commits, expiring old snapshots, and automatically cleaning up 
metadata files is highly recommended.
   
   Then you could give an overview of those options and links to further docs, 
like the table property docs for delete-after-commit.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] rdblue commented on a change in pull request #1261: Spark: [DOC] guide about structured streaming sink for Iceberg

Reply via email to