Regenerate website
Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/6bded068 Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/6bded068 Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/6bded068 Branch: refs/heads/asf-site Commit: 6bded068a06eb7b17cf881d0871cb30f2e986084 Parents: 8c9cda3 Author: Ahmet Altay <al...@google.com> Authored: Thu Apr 6 16:11:57 2017 -0700 Committer: Ahmet Altay <al...@google.com> Committed: Thu Apr 6 16:11:57 2017 -0700 ---------------------------------------------------------------------- .../documentation/io/authoring-java/index.html | 9 ++ .../io/authoring-overview/index.html | 97 +++++++++++++++----- content/documentation/io/io-toc/index.html | 11 ++- 3 files changed, 90 insertions(+), 27 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/beam-site/blob/6bded068/content/documentation/io/authoring-java/index.html ---------------------------------------------------------------------- diff --git a/content/documentation/io/authoring-java/index.html b/content/documentation/io/authoring-java/index.html index 5128d93..7f2a308 100644 --- a/content/documentation/io/authoring-java/index.html +++ b/content/documentation/io/authoring-java/index.html @@ -159,6 +159,15 @@ <p>Note: This guide is still in progress. There is an open issue to finish the guide: <a href="https://issues.apache.org/jira/browse/BEAM-1025">BEAM-1025</a>.</p> </blockquote> +<h2 id="example-io-transforms">Example I/O Transforms</h2> +<p>Currently, Apache Beamâs I/O transforms use a variety of different +styles. These transforms are good examples to follow:</p> +<ul> + <li><a href="https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreIO.java"><code class="highlighter-rouge">DatastoreIO</code></a> - <code class="highlighter-rouge">ParDo</code> based database read and write that conforms to the PTransform style guide</li> + <li><a href="https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java"><code class="highlighter-rouge">BigtableIO</code></a> - Good test examples, and demonstrates Dynamic Work Rebalancing</li> + <li><a href="https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java"><code class="highlighter-rouge">JdbcIO</code></a> - Demonstrates reading using single <code class="highlighter-rouge">ParDo</code>+<code class="highlighter-rouge">GroupByKey</code> when data stores cannot be read in parallel</li> +</ul> + <h1 id="next-steps">Next steps</h1> <p><a href="/documentation/io/testing/">Testing I/O Transforms</a></p> http://git-wip-us.apache.org/repos/asf/beam-site/blob/6bded068/content/documentation/io/authoring-overview/index.html ---------------------------------------------------------------------- diff --git a/content/documentation/io/authoring-overview/index.html b/content/documentation/io/authoring-overview/index.html index 73fffa2..5e36676 100644 --- a/content/documentation/io/authoring-overview/index.html +++ b/content/documentation/io/authoring-overview/index.html @@ -157,54 +157,107 @@ <p><em>A guide for users who need to connect to a data store that isnât supported by the <a href="/documentation/io/built-in/">Built-in I/O Transforms</a></em></p> -<blockquote> - <p>Note: This guide is still in progress. There is an open issue to finish the guide: <a href="https://issues.apache.org/jira/browse/BEAM-1025">BEAM-1025</a>.</p> -</blockquote> - <ul id="markdown-toc"> <li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li> - <li><a href="#example-io-transforms" id="markdown-toc-example-io-transforms">Example I/O Transforms</a></li> <li><a href="#suggested-steps-for-implementers" id="markdown-toc-suggested-steps-for-implementers">Suggested steps for implementers</a></li> <li><a href="#read-transforms" id="markdown-toc-read-transforms">Read transforms</a> <ul> - <li><a href="#when-to-implement-using-the-source-api" id="markdown-toc-when-to-implement-using-the-source-api">When to implement using the Source API</a></li> + <li><a href="#when-to-implement-using-the-source-api" id="markdown-toc-when-to-implement-using-the-source-api">When to implement using the <code class="highlighter-rouge">Source</code> API</a></li> </ul> </li> <li><a href="#write-transforms" id="markdown-toc-write-transforms">Write transforms</a> <ul> - <li><a href="#when-to-implement-using-the-sink-api" id="markdown-toc-when-to-implement-using-the-sink-api">When to implement using the Sink API</a></li> + <li><a href="#when-to-implement-using-the-sink-api" id="markdown-toc-when-to-implement-using-the-sink-api">When to implement using the <code class="highlighter-rouge">Sink</code> API</a></li> </ul> </li> </ul> <h2 id="introduction">Introduction</h2> -<p>TODO</p> +<p>This guide covers how to implement I/O transforms in the Beam model. Beam pipelines use these read and write transforms to import data for processing, and write data to a store.</p> + +<p>Reading and writing data in Beam is a parallel task, and using <code class="highlighter-rouge">ParDo</code>s, <code class="highlighter-rouge">GroupByKey</code>s, etc⦠is usually sufficient. Rarely, you will need the more specialized <code class="highlighter-rouge">Source</code> and <code class="highlighter-rouge">Sink</code> classes for specific features. There are changes coming soon (<code class="highlighter-rouge">SplittableDoFn</code>, <a href="https://issues.apache.org/jira/browse/BEAM-65">BEAM-65</a>) that will make <code class="highlighter-rouge">Source</code> unnecessary.</p> -<h2 id="example-io-transforms">Example I/O Transforms</h2> -<p>TODO</p> +<p>As you work on your I/O Transform, be aware that the Beam community is excited to help those building new I/O Transforms and that there are many examples and helper classes.</p> <h2 id="suggested-steps-for-implementers">Suggested steps for implementers</h2> -<p>TODO</p> +<ol> + <li>Check out this guide and come up with your design. If youâd like, you can email the <a href="/get-started/support">Beam dev mailing list</a> with any questions you might have. Itâs good to check there to see if anyone else is working on the same I/O Transform.</li> + <li>If you are planning to contribute your I/O transform to the Beam community, youâll be going through the normal Beam contribution life cycle - see the <a href="/contribute/contribution-guide/">Apache Beam Contribution Guide</a> for more details.</li> + <li>As youâre working on your IO transform, see the <a href="/contribute/ptransform-style-guide/">PTransform Style Guide</a> for specific information about writing I/O Transforms.</li> +</ol> <h2 id="read-transforms">Read transforms</h2> -<p>TODO</p> +<p>Read transforms take data from outside of the Beam pipeline and produce <code class="highlighter-rouge">PCollection</code>s of data.</p> -<h3 id="when-to-implement-using-the-source-api">When to implement using the Source API</h3> -<p>TODO</p> +<p>For data stores or file types where the data can be read in parallel, you can think of the process as a mini-pipeline. This often consists of two steps:</p> +<ol> + <li>Splitting the data into parts to be read in parallel</li> + <li>Reading from each of those parts</li> +</ol> -<h2 id="write-transforms">Write transforms</h2> -<p>TODO</p> +<p>Each of those steps will be a <code class="highlighter-rouge">ParDo</code>, with a <code class="highlighter-rouge">GroupByKey</code> in between. The <code class="highlighter-rouge">GroupByKey</code> is an implementation detail, but for most runners it allows the runner to use different numbers of workers for:</p> +<ul> + <li>Determining how to split up the data to be read into chunks - this will likely occur on very few workers</li> + <li>Reading - will likely benefit from more workers</li> +</ul> -<h3 id="when-to-implement-using-the-sink-api">When to implement using the Sink API</h3> -<p>TODO</p> +<p>The <code class="highlighter-rouge">GroupByKey</code> will also allow Dynamic Work Rebalancing to occur (on supported runners).</p> -<h1 id="next-steps">Next steps</h1> +<p>Here are some examples of read transform implementations that use the âreading as a mini-pipelineâ model when data can be read in parallel:</p> +<ul> + <li><strong>Reading from a file glob</strong> - For example reading all files in â~/data/**â + <ul> + <li>Get File Paths <code class="highlighter-rouge">ParDo</code>: As input, take in a file glob. Produce a <code class="highlighter-rouge">PCollection</code> of strings, each of which is a file path.</li> + <li>Reading <code class="highlighter-rouge">ParDo</code>: Given the <code class="highlighter-rouge">PCollection</code> of file paths, read each one, producing a <code class="highlighter-rouge">PCollection</code> of records.</li> + </ul> + </li> + <li><strong>Reading from a NoSQL Database</strong> (eg Apache HBase) - these databases often allow reading from ranges in parallel. + <ul> + <li>Determine Key Ranges <code class="highlighter-rouge">ParDo</code>: As input, receive connection information for the database and the key range to read from. Produce a <code class="highlighter-rouge">PCollection</code> of key ranges that can be read in parallel efficiently.</li> + <li>Read Key Range <code class="highlighter-rouge">ParDo</code>: Given the <code class="highlighter-rouge">PCollection</code> of key ranges, read the key range, producing a <code class="highlighter-rouge">PCollection</code> of records.</li> + </ul> + </li> +</ul> + +<p>For data stores or files where reading cannot occur in parallel, reading is a simple task that can be accomplished with a single <code class="highlighter-rouge">ParDo</code>+<code class="highlighter-rouge">GroupByKey</code>. For example:</p> +<ul> + <li><strong>Reading from a database query</strong> - traditional SQL database queries often can only be read in sequence. The <code class="highlighter-rouge">ParDo</code> in this case would establish a connection to the database and read batches of records, producing a <code class="highlighter-rouge">PCollection</code> of those records.</li> + <li><strong>Reading from a gzip file</strong> - a gzip file has to be read in order, so it cannot be parallelized. The <code class="highlighter-rouge">ParDo</code> in this case would open the file and read in sequence, producing a <code class="highlighter-rouge">PCollection</code> of records from the file.</li> +</ul> -<p>For more details on actual implementation, continue with one of the the language specific guides:</p> +<h3 id="when-to-implement-using-the-source-api">When to implement using the <code class="highlighter-rouge">Source</code> API</h3> +<p>The above discussion is in terms of <code class="highlighter-rouge">ParDo</code>s - this is because <code class="highlighter-rouge">Source</code>s have proven to be tricky to implement. At this point in time, the recommendation is to <strong>use <code class="highlighter-rouge">Source</code> only if <code class="highlighter-rouge">ParDo</code> doesnât meet your needs</strong>. A class derived from <code class="highlighter-rouge">FileBasedSource</code> is often the best option when reading from files.</p> +<p>If youâre trying to decide on whether or not to use <code class="highlighter-rouge">Source</code>, feel free to email the <a href="/get-started/support">Beam dev mailing list</a> and we can discuss the specific pros and cons of your case.</p> + +<p>In some cases implementing a <code class="highlighter-rouge">Source</code> may be necessary or result in better performance.</p> <ul> - <li><a href="/documentation/io/authoring-python/">Authoring I/O Transforms - Python</a></li> - <li><a href="/documentation/io/authoring-java/">Authoring I/O Transforms - Java</a></li> + <li><code class="highlighter-rouge">ParDo</code>s will not work for reading from unbounded sources - they do not support checkpointing and donât support mechanisms like de-duping that have proven useful for streaming data sources.</li> + <li><code class="highlighter-rouge">ParDo</code>s cannot provide hints to runners about their progress or the size of data they are reading - without size estimation of the data or progress on your read, the runner doesnât have any way to guess how large your read will be, and thus if it attempts to dynamically allocate workers, it does not have any clues as to how many workers you may need for your pipeline.</li> + <li><code class="highlighter-rouge">ParDo</code>s do not support Dynamic Work Rebalancing - these are features used by some readers to improve the processing speed of jobs (but may not be possible with your data source).</li> + <li><code class="highlighter-rouge">ParDo</code>s do not receive âdesired_bundle_sizeâ as a hint from runners when performing initial splitting. +<code class="highlighter-rouge">SplittableDoFn</code> (<a href="https://issues.apache.org/jira/browse/BEAM-65">BEAM-65</a>) will mitigate many of these concerns.</li> </ul> +<h2 id="write-transforms">Write transforms</h2> +<p>Write transforms are responsible for taking the contents of a <code class="highlighter-rouge">PCollection</code> and transferring that data outside of the Beam pipeline.</p> + +<p>Write transforms can usually be implemented using a single <code class="highlighter-rouge">ParDo</code> that writes the records received to the data store.</p> + +<p>TODO: this section needs further explanation.</p> + +<h3 id="when-to-implement-using-the-sink-api">When to implement using the <code class="highlighter-rouge">Sink</code> API</h3> +<p>You are strongly discouraged from using the <code class="highlighter-rouge">Sink</code> class unless you are creating a <code class="highlighter-rouge">FileBasedSink</code>. Most of the time, a simple <code class="highlighter-rouge">ParDo</code> is all thatâs necessary. If you think you have a case that is only possible using a <code class="highlighter-rouge">Sink</code>, please email the <a href="/get-started/support">Beam dev mailing list</a>.</p> + +<h1 id="next-steps">Next steps</h1> + +<p>This guide is still in progress. There is an open issue to finish the guide: <a href="https://issues.apache.org/jira/browse/BEAM-1025">BEAM-1025</a>.</p> + +<!-- TODO: commented out until this content is ready. +For more details on actual implementation, continue with one of the the language specific guides: + +* [Authoring I/O Transforms - Python](/documentation/io/authoring-python/) +* [Authoring I/O Transforms - Java](/documentation/io/authoring-java/) +--> + </div> http://git-wip-us.apache.org/repos/asf/beam-site/blob/6bded068/content/documentation/io/io-toc/index.html ---------------------------------------------------------------------- diff --git a/content/documentation/io/io-toc/index.html b/content/documentation/io/io-toc/index.html index 3e35922..a8add8e 100644 --- a/content/documentation/io/io-toc/index.html +++ b/content/documentation/io/io-toc/index.html @@ -165,15 +165,16 @@ <p>Note: This guide is still in progress. There is an open issue to finish the guide: <a href="https://issues.apache.org/jira/browse/BEAM-1025">BEAM-1025</a>.</p> </blockquote> -<!-- TODO: commented out until this content is ready. - -This series of articles will walk you through the process of creating a new I/O transform. +<ul> + <li><a href="/documentation/io/authoring-overview/">Authoring I/O Transforms - Overview</a></li> +</ul> -* [Authoring I/O Transforms - Overview](/documentation/io/authoring-overview/) +<!-- TODO: commented out until this content is ready. * [Authoring I/O Transforms - Python](/documentation/io/authoring-python/) * [Authoring I/O Transforms - Java](/documentation/io/authoring-java/) * [Testing I/O Transforms](/documentation/io/testing/) -* [Contributing I/O Transforms](/documentation/io/contributing/) --> +* [Contributing I/O Transforms](/documentation/io/contributing/) +--> </div>