Repository: beam-site Updated Branches: refs/heads/asf-site 6f44376a9 -> fd7fe0627
Add Read Transform content for Authoring I/O Transforms - Overview Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/8c9cda3c Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/8c9cda3c Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/8c9cda3c Branch: refs/heads/asf-site Commit: 8c9cda3cb174f83f537eb2f4bee2babf77540be8 Parents: 6f44376 Author: Stephen Sisk <[email protected]> Authored: Tue Mar 28 15:09:28 2017 -0700 Committer: Ahmet Altay <[email protected]> Committed: Thu Apr 6 16:11:21 2017 -0700 ---------------------------------------------------------------------- src/documentation/io/authoring-java.md | 8 +++ src/documentation/io/authoring-overview.md | 67 +++++++++++++++++++++---- src/documentation/io/io-toc.md | 9 ++-- 3 files changed, 68 insertions(+), 16 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/beam-site/blob/8c9cda3c/src/documentation/io/authoring-java.md ---------------------------------------------------------------------- diff --git a/src/documentation/io/authoring-java.md b/src/documentation/io/authoring-java.md index 6cdb6bd..d1d7013 100644 --- a/src/documentation/io/authoring-java.md +++ b/src/documentation/io/authoring-java.md @@ -10,6 +10,14 @@ permalink: /documentation/io/authoring-java/ > Note: This guide is still in progress. There is an open issue to finish the > guide: [BEAM-1025](https://issues.apache.org/jira/browse/BEAM-1025). +## Example I/O Transforms +Currently, Apache Beam's I/O transforms use a variety of different +styles. These transforms are good examples to follow: +* [`DatastoreIO`](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreIO.java) - `ParDo` based database read and write that conforms to the PTransform style guide +* [`BigtableIO`](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java) - Good test examples, and demonstrates Dynamic Work Rebalancing +* [`JdbcIO`](https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java) - Demonstrates reading using single `ParDo`+`GroupByKey` when data stores cannot be read in parallel + + # Next steps [Testing I/O Transforms]({{site.baseurl }}/documentation/io/testing/) http://git-wip-us.apache.org/repos/asf/beam-site/blob/8c9cda3c/src/documentation/io/authoring-overview.md ---------------------------------------------------------------------- diff --git a/src/documentation/io/authoring-overview.md b/src/documentation/io/authoring-overview.md index dab6a85..186d853 100644 --- a/src/documentation/io/authoring-overview.md +++ b/src/documentation/io/authoring-overview.md @@ -10,35 +10,80 @@ permalink: /documentation/io/authoring-overview/ _A guide for users who need to connect to a data store that isn't supported by the [Built-in I/O Transforms]({{site.baseurl }}/documentation/io/built-in/)_ -> Note: This guide is still in progress. There is an open issue to finish the guide: [BEAM-1025](https://issues.apache.org/jira/browse/BEAM-1025). * TOC {:toc} ## Introduction -TODO +This guide covers how to implement I/O transforms in the Beam model. Beam pipelines use these read and write transforms to import data for processing, and write data to a store. + +Reading and writing data in Beam is a parallel task, and using `ParDo`s, `GroupByKey`s, etc... is usually sufficient. Rarely, you will need the more specialized `Source` and `Sink` classes for specific features. There are changes coming soon (`SplittableDoFn`, [BEAM-65](https://issues.apache.org/jira/browse/BEAM-65)) that will make `Source` unnecessary. + +As you work on your I/O Transform, be aware that the Beam community is excited to help those building new I/O Transforms and that there are many examples and helper classes. -## Example I/O Transforms -TODO ## Suggested steps for implementers -TODO +1. Check out this guide and come up with your design. If you'd like, you can email the [Beam dev mailing list]({{ site.baseurl }}/get-started/support) with any questions you might have. It's good to check there to see if anyone else is working on the same I/O Transform. +2. If you are planning to contribute your I/O transform to the Beam community, you'll be going through the normal Beam contribution life cycle - see the [Apache Beam Contribution Guide]({{ site.baseurl }}/contribute/contribution-guide/) for more details. +3. As you're working on your IO transform, see the [PTransform Style Guide]({{ site.baseurl }}/contribute/ptransform-style-guide/) for specific information about writing I/O Transforms. + ## Read transforms -TODO +Read transforms take data from outside of the Beam pipeline and produce `PCollection`s of data. + +For data stores or file types where the data can be read in parallel, you can think of the process as a mini-pipeline. This often consists of two steps: +1. Splitting the data into parts to be read in parallel +2. Reading from each of those parts + +Each of those steps will be a `ParDo`, with a `GroupByKey` in between. The `GroupByKey` is an implementation detail, but for most runners it allows the runner to use different numbers of workers for: +* Determining how to split up the data to be read into chunks - this will likely occur on very few workers +* Reading - will likely benefit from more workers + +The `GroupByKey` will also allow Dynamic Work Rebalancing to occur (on supported runners). + +Here are some examples of read transform implementations that use the "reading as a mini-pipeline" model when data can be read in parallel: +* **Reading from a file glob** - For example reading all files in "~/data/**" + * Get File Paths `ParDo`: As input, take in a file glob. Produce a `PCollection` of strings, each of which is a file path. + * Reading `ParDo`: Given the `PCollection` of file paths, read each one, producing a `PCollection` of records. +* **Reading from a NoSQL Database** (eg Apache HBase) - these databases often allow reading from ranges in parallel. + * Determine Key Ranges `ParDo`: As input, receive connection information for the database and the key range to read from. Produce a `PCollection` of key ranges that can be read in parallel efficiently. + * Read Key Range `ParDo`: Given the `PCollection` of key ranges, read the key range, producing a `PCollection` of records. + +For data stores or files where reading cannot occur in parallel, reading is a simple task that can be accomplished with a single `ParDo`+`GroupByKey`. For example: +* **Reading from a database query** - traditional SQL database queries often can only be read in sequence. The `ParDo` in this case would establish a connection to the database and read batches of records, producing a `PCollection` of those records. +* **Reading from a gzip file** - a gzip file has to be read in order, so it cannot be parallelized. The `ParDo` in this case would open the file and read in sequence, producing a `PCollection` of records from the file. + + +### When to implement using the `Source` API +The above discussion is in terms of `ParDo`s - this is because `Source`s have proven to be tricky to implement. At this point in time, the recommendation is to **use `Source` only if `ParDo` doesn't meet your needs**. A class derived from `FileBasedSource` is often the best option when reading from files. + + If you're trying to decide on whether or not to use `Source`, feel free to email the [Beam dev mailing list]({{ site.baseurl }}/get-started/support) and we can discuss the specific pros and cons of your case. + +In some cases implementing a `Source` may be necessary or result in better performance. +* `ParDo`s will not work for reading from unbounded sources - they do not support checkpointing and don't support mechanisms like de-duping that have proven useful for streaming data sources. +* `ParDo`s cannot provide hints to runners about their progress or the size of data they are reading - without size estimation of the data or progress on your read, the runner doesn't have any way to guess how large your read will be, and thus if it attempts to dynamically allocate workers, it does not have any clues as to how many workers you may need for your pipeline. +* `ParDo`s do not support Dynamic Work Rebalancing - these are features used by some readers to improve the processing speed of jobs (but may not be possible with your data source). +* `ParDo`s do not receive 'desired_bundle_size' as a hint from runners when performing initial splitting. +`SplittableDoFn` ([BEAM-65](https://issues.apache.org/jira/browse/BEAM-65)) will mitigate many of these concerns. -### When to implement using the Source API -TODO ## Write transforms -TODO +Write transforms are responsible for taking the contents of a `PCollection` and transferring that data outside of the Beam pipeline. -### When to implement using the Sink API -TODO +Write transforms can usually be implemented using a single `ParDo` that writes the records received to the data store. + +TODO: this section needs further explanation. + +### When to implement using the `Sink` API +You are strongly discouraged from using the `Sink` class unless you are creating a `FileBasedSink`. Most of the time, a simple `ParDo` is all that's necessary. If you think you have a case that is only possible using a `Sink`, please email the [Beam dev mailing list]({{ site.baseurl }}/get-started/support). # Next steps +This guide is still in progress. There is an open issue to finish the guide: [BEAM-1025](https://issues.apache.org/jira/browse/BEAM-1025). + +<!-- TODO: commented out until this content is ready. For more details on actual implementation, continue with one of the the language specific guides: * [Authoring I/O Transforms - Python]({{site.baseurl }}/documentation/io/authoring-python/) * [Authoring I/O Transforms - Java]({{site.baseurl }}/documentation/io/authoring-java/) +--> http://git-wip-us.apache.org/repos/asf/beam-site/blob/8c9cda3c/src/documentation/io/io-toc.md ---------------------------------------------------------------------- diff --git a/src/documentation/io/io-toc.md b/src/documentation/io/io-toc.md index ec6b244..811f70a 100644 --- a/src/documentation/io/io-toc.md +++ b/src/documentation/io/io-toc.md @@ -15,12 +15,11 @@ permalink: /documentation/io/io-toc/ > Note: This guide is still in progress. There is an open issue to finish the > guide: [BEAM-1025](https://issues.apache.org/jira/browse/BEAM-1025). -<!-- TODO: commented out until this content is ready. - -This series of articles will walk you through the process of creating a new I/O transform. - * [Authoring I/O Transforms - Overview]({{site.baseurl }}/documentation/io/authoring-overview/) + +<!-- TODO: commented out until this content is ready. * [Authoring I/O Transforms - Python]({{site.baseurl }}/documentation/io/authoring-python/) * [Authoring I/O Transforms - Java]({{site.baseurl }}/documentation/io/authoring-java/) * [Testing I/O Transforms]({{site.baseurl }}/documentation/io/testing/) -* [Contributing I/O Transforms]({{site.baseurl }}/documentation/io/contributing/) --> +* [Contributing I/O Transforms]({{site.baseurl }}/documentation/io/contributing/) +-->
