Seems like the documentation about creating new IO connectors [1] is out of 
date and it makes people get confused about the recommended way of developing :

“Splittable DoFn is a new sources framework that is under development and will 
replace the other options for developing bounded and unbounded sources."

Do you think we need rewrite this section completely according to a large 
progress with moving to SDF-based connectors in last time? Though, it would be 
useful to keep an old (current) one since Source API is still used.


[1] https://beam.apache.org/documentation/io/developing-io-overview/

> On 25 Nov 2020, at 19:37, Boyuan Zhang <[email protected]> wrote:
> 
> +dev <mailto:[email protected]> 
> 
> Hi Bashir,
> 
> Most recently we are recommending to use Splittable DoFn[1] to build new IO 
> connectors. We have several examples for that in our codebase:
> Java examples:
> Kafka 
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/ReadFromKafkaDoFn.java#L118>
>  - An I/O connector for Apache Kafka <https://kafka.apache.org/> (an 
> open-source distributed event streaming platform).
> Watch 
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L787>
>  - Uses a polling function producing a growing set of outputs for each input 
> until a per-input termination condition is met.
> Parquet 
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java#L365>
>  - An I/O connector for Apache Parquet <https://parquet.apache.org/> (an 
> open-source columnar storage format).
> HL7v2 
> <https://github.com/apache/beam/blob/6fdde4f4eab72b49b10a8bb1cb3be263c5c416b5/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/healthcare/HL7v2IO.java#L493>
>  - An I/O connector for HL7v2 messages (a clinical messaging format that 
> provides data about events that occur inside an organization) part of 
> Google’s Cloud Healthcare API <https://cloud.google.com/healthcare>.
> BoundedSource wrapper 
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L248>
>  - A wrapper which converts an existing BoundedSource implementation to a 
> splittable DoFn.
> UnboundedSource wrapper 
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java#L432>
>  - A wrapper which converts an existing UnboundedSource implementation to a 
> splittable DoFn.
> 
> Python examples:
> BoundedSourceWrapper 
> <https://github.com/apache/beam/blob/571338b0cc96e2e80f23620fe86de5c92dffaccc/sdks/python/apache_beam/io/iobase.py#L1375>
>  - A wrapper which converts an existing BoundedSource implementation to a 
> splittable DoFn.
> 
> [1] https://beam.apache.org/documentation/programming-guide/#splittable-dofns 
> <https://beam.apache.org/documentation/programming-guide/#splittable-dofns>
> On Wed, Nov 25, 2020 at 8:19 AM Bashir Sadjad <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi,
> 
> I have a scenario in which a streaming pipeline should read update messages 
> from MySQL binlog (through Debezium). To implement this pipeline using Beam, 
> I understand there is a KafkaIO which I can use. But I also want to support a 
> local mode in which there is no Kafka and the messages are directly consumed 
> using embedded Debezium because this is a much simpler architecture (no 
> Kafka, ZooKeeper, and Kafka Connect).
> 
> I did a little bit of search and it seems there is no IO connector for 
> Debezim, hence I have to implement one following this guide 
> <https://beam.apache.org/documentation/io/developing-io-java/>. I wonder:
> 
> 1) Does this approach make sense or is it better to rely on Kafka even for 
> the local single machine use case?
> 
> 2) Beside the above guide, is there any simple example IO that I can follow 
> to implement the UnboundedSource/Reader? I have looked at some examples here 
> <https://github.com/apache/beam/tree/master/sdks/java/io> but was wondering 
> if there is a recommended/simple one as a tutorial.
> 
> Thanks
> 
> -B
> P.S. If this is better suited for dev@, please feel free to move it to that 
> list.

Reply via email to