This is an automated email from the ASF dual-hosted git repository.
boyuanz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new 4c51569 Add splittable dofn as the recommended way of building
connectors.
new 3de140f Merge pull request #13227 from [BEAM-10480] Add splittable
dofn as the recommended way of building connectors.
4c51569 is described below
commit 4c51569d3b972e2271efcc520a48ccb1bd20c9be
Author: Boyuan Zhang <[email protected]>
AuthorDate: Thu Oct 29 15:13:35 2020 -0700
Add splittable dofn as the recommended way of building connectors.
---
.../en/documentation/io/developing-io-java.md | 3 +
.../en/documentation/io/developing-io-overview.md | 80 +++++++++++++---------
.../en/documentation/io/developing-io-python.md | 3 +
3 files changed, 53 insertions(+), 33 deletions(-)
diff --git a/website/www/site/content/en/documentation/io/developing-io-java.md
b/website/www/site/content/en/documentation/io/developing-io-java.md
index 7de2024..7836a3c 100644
--- a/website/www/site/content/en/documentation/io/developing-io-java.md
+++ b/website/www/site/content/en/documentation/io/developing-io-java.md
@@ -17,6 +17,9 @@ limitations under the License.
-->
# Developing I/O connectors for Java
+**IMPORTANT:** Use ``Splittable DoFn`` to develop your new I/O. For more
details, read the
+[new I/O connector overview](/documentation/io/developing-io-overview/).
+
To connect to a data store that isn’t supported by Beam’s existing I/O
connectors, you must create a custom I/O connector that usually consist of a
source and a sink. All Beam sources and sinks are composite transforms;
however,
diff --git
a/website/www/site/content/en/documentation/io/developing-io-overview.md
b/website/www/site/content/en/documentation/io/developing-io-overview.md
index 0ea507f..c8e0482 100644
--- a/website/www/site/content/en/documentation/io/developing-io-overview.md
+++ b/website/www/site/content/en/documentation/io/developing-io-overview.md
@@ -46,33 +46,32 @@ are the recommended steps to get started:
For **bounded (batch) sources**, there are currently two options for creating a
Beam source:
+1. Use `Splittable DoFn`.
+
1. Use `ParDo` and `GroupByKey`.
-1. Use the `Source` interface and extend the `BoundedSource` abstract subclass.
-`ParDo` is the recommended option, as implementing a `Source` can be tricky.
See
-[When to use the Source interface](#when-to-use-source) for a list of some use
-cases where you might want to use a `Source` (such as
-[dynamic work rebalancing](/blog/2016/05/18/splitAtFraction-method.html)).
+`Splittable DoFn` is the recommended option, as it's the most recent source
framework for both
+bounded and unbounded sources. This is meant to replace the `Source` APIs(
+[BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html)
and
+[UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html))
+in the new system. Read
+[Splittable DoFn Programming
Guide](/learn/programming-guide/#splittable-dofns) for how to write one
+Splittable DoFn. For more information, see the
+[roadmap for multi-SDK connector efforts](/roadmap/connectors-multi-sdk/).
-(Java only) For **unbounded (streaming) sources**, you must use the `Source`
-interface and extend the `UnboundedSource` abstract subclass. `UnboundedSource`
-supports features that are useful for streaming pipelines, such as
-checkpointing.
+For Java and Python **unbounded (streaming) sources**, you must use the
`Splittable DoFn`, which
+supports features that are useful for streaming pipelines, including
checkpointing, controlling
+watermark, and tracking backlog.
-Splittable DoFn is a new sources framework that is under development and will
-replace the other options for developing bounded and unbounded sources. For
more
-information, see the
-[roadmap for multi-SDK connector efforts](/roadmap/connectors-multi-sdk/).
-### When to use the Source interface {#when-to-use-source}
+### When to use the Splittable DoFn interface {#when-to-use-splittable-dofn}
-If you are not sure whether to use `Source`, feel free to email the [Beam dev
-mailing list](/get-started/support) and we can discuss the
-specific pros and cons of your case.
+If you are not sure whether to use `Splittable DoFn`, feel free to email the
+[Beam dev mailing list](/get-started/support) and we can discuss the specific
pros and cons of your
+case.
-In some cases, implementing a `Source` might be necessary or result in better
-performance:
+In some cases, implementing a `Splittable DoFn` might be necessary or result
in better performance:
* **Unbounded sources:** `ParDo` does not work for reading from unbounded
sources. `ParDo` does not support checkpointing or mechanisms like de-duping
@@ -90,22 +89,40 @@ performance:
jobs. Depending on your data source, dynamic work rebalancing might not be
possible.
-* **Splitting into parts of particular size recommended by the runner:**
`ParDo`
- does not receive `desired_bundle_size` as a hint from runners when performing
- initial splitting.
+* **Splitting initially to increase parallelism:** `ParDo`
+ does not have the ability to perform initial splitting.
For example, if you'd like to read from a new file format that contains many
records per file, or if you'd like to read from a key-value store that supports
read operations in sorted key order.
-### Source lifecycle {#source}
-Here is a sequence diagram that shows the lifecycle of the Source during
- the execution of the Read transform of an IO. The comments give useful
- information to IO developers such as the constraints that
- apply to the objects or particular cases such as streaming mode.
-
- <!-- The source for the sequence diagram can be found in the the SVG
resource. -->
-
+### I/O examples using SDFs
+**Java Examples**
+
+*
[Kafka](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java#L118):
+An I/O connector for [Apache Kafka](https://kafka.apache.org/)
+(an open-source distributed event streaming platform).
+*
[Watch](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787):
+Uses a polling function producing a growing set of outputs for each input
until a per-input
+termination condition is met.
+*
[Parquet](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L365):
+An I/O connector for [Apache Parquet](https://parquet.apache.org/)
+(an open-source columnar storage format).
+*
[HL7v2](https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java#L493):
+An I/O connector for HL7v2 messages (a clinical messaging format that provides
data about events
+that occur inside an organization) part of
+[Google’s Cloud Healthcare API](https://cloud.google.com/healthcare).
+* [BoundedSource
wrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L248):
+A wrapper which converts an existing
[BoundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/BoundedSource.html)
+implementation to a splittable DoFn.
+* [UnboundedSource
wrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L432):
+A wrapper which converts an existing
[UnboundedSource](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/UnboundedSource.html)
+implementation to a splittable DoFn.
+
+**Python Examples**
+*
[BoundedSourceWrapper](https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/python/apache_beam/io/iobase.py#L1375):
+A wrapper which converts an existing
[BoundedSource](https://beam.apache.org/releases/pydoc/current/apache_beam.io.iobase.html#apache_beam.io.iobase.BoundedSource)
+implementation to a splittable DoFn.
### Using ParDo and GroupByKey
@@ -157,7 +174,6 @@ example:
cannot be parallelized. In this case, the `ParDo` would open the file and
read in sequence, producing a `PCollection` of records from the file.
-
## Sinks
To create a Beam sink, we recommend that you use a `ParDo` that writes the
@@ -169,8 +185,6 @@ For **file-based sinks**, you can use the `FileBasedSink`
abstraction that is
provided by both the Java and Python SDKs. See our language specific
implementation guides for more details:
-* [Developing I/O connectors for Java](/documentation/io/developing-io-java/)
-* [Developing I/O connectors for
Python](/documentation/io/developing-io-python/)
diff --git
a/website/www/site/content/en/documentation/io/developing-io-python.md
b/website/www/site/content/en/documentation/io/developing-io-python.md
index 039b633..7c7705b 100644
--- a/website/www/site/content/en/documentation/io/developing-io-python.md
+++ b/website/www/site/content/en/documentation/io/developing-io-python.md
@@ -19,6 +19,9 @@ limitations under the License.
-->
# Developing I/O connectors for Python
+**IMPORTANT:** Please use ``Splittable DoFn`` to develop your new I/O. For
more details, please read
+the [new I/O connector overview](/documentation/io/developing-io-overview/).
+
To connect to a data store that isn’t supported by Beam’s existing I/O
connectors, you must create a custom I/O connector that usually consist of a
source and a sink. All Beam sources and sinks are composite transforms;
however,