Regenerate website

Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo
Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/6bded068
Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/6bded068
Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/6bded068

Branch: refs/heads/asf-site
Commit: 6bded068a06eb7b17cf881d0871cb30f2e986084
Parents: 8c9cda3
Author: Ahmet Altay <al...@google.com>
Authored: Thu Apr 6 16:11:57 2017 -0700
Committer: Ahmet Altay <al...@google.com>
Committed: Thu Apr 6 16:11:57 2017 -0700

----------------------------------------------------------------------
 .../documentation/io/authoring-java/index.html  |  9 ++
 .../io/authoring-overview/index.html            | 97 +++++++++++++++-----
 content/documentation/io/io-toc/index.html      | 11 ++-
 3 files changed, 90 insertions(+), 27 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/beam-site/blob/6bded068/content/documentation/io/authoring-java/index.html
----------------------------------------------------------------------
diff --git a/content/documentation/io/authoring-java/index.html 
b/content/documentation/io/authoring-java/index.html
index 5128d93..7f2a308 100644
--- a/content/documentation/io/authoring-java/index.html
+++ b/content/documentation/io/authoring-java/index.html
@@ -159,6 +159,15 @@
   <p>Note: This guide is still in progress. There is an open issue to finish 
the guide: <a 
href="https://issues.apache.org/jira/browse/BEAM-1025";>BEAM-1025</a>.</p>
 </blockquote>
 
+<h2 id="example-io-transforms">Example I/O Transforms</h2>
+<p>Currently, Apache Beam’s I/O transforms use a variety of different
+styles. These transforms are good examples to follow:</p>
+<ul>
+  <li><a 
href="https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/datastore/DatastoreIO.java";><code
 class="highlighter-rouge">DatastoreIO</code></a> - <code 
class="highlighter-rouge">ParDo</code> based database read and write that 
conforms to the PTransform style guide</li>
+  <li><a 
href="https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java";><code
 class="highlighter-rouge">BigtableIO</code></a> - Good test examples, and 
demonstrates Dynamic Work Rebalancing</li>
+  <li><a 
href="https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java";><code
 class="highlighter-rouge">JdbcIO</code></a> - Demonstrates reading using 
single <code class="highlighter-rouge">ParDo</code>+<code 
class="highlighter-rouge">GroupByKey</code> when data stores cannot be read in 
parallel</li>
+</ul>
+
 <h1 id="next-steps">Next steps</h1>
 
 <p><a href="/documentation/io/testing/">Testing I/O Transforms</a></p>

http://git-wip-us.apache.org/repos/asf/beam-site/blob/6bded068/content/documentation/io/authoring-overview/index.html
----------------------------------------------------------------------
diff --git a/content/documentation/io/authoring-overview/index.html 
b/content/documentation/io/authoring-overview/index.html
index 73fffa2..5e36676 100644
--- a/content/documentation/io/authoring-overview/index.html
+++ b/content/documentation/io/authoring-overview/index.html
@@ -157,54 +157,107 @@
 
 <p><em>A guide for users who need to connect to a data store that isn’t 
supported by the <a href="/documentation/io/built-in/">Built-in I/O 
Transforms</a></em></p>
 
-<blockquote>
-  <p>Note: This guide is still in progress. There is an open issue to finish 
the guide: <a 
href="https://issues.apache.org/jira/browse/BEAM-1025";>BEAM-1025</a>.</p>
-</blockquote>
-
 <ul id="markdown-toc">
   <li><a href="#introduction" 
id="markdown-toc-introduction">Introduction</a></li>
-  <li><a href="#example-io-transforms" 
id="markdown-toc-example-io-transforms">Example I/O Transforms</a></li>
   <li><a href="#suggested-steps-for-implementers" 
id="markdown-toc-suggested-steps-for-implementers">Suggested steps for 
implementers</a></li>
   <li><a href="#read-transforms" id="markdown-toc-read-transforms">Read 
transforms</a>    <ul>
-      <li><a href="#when-to-implement-using-the-source-api" 
id="markdown-toc-when-to-implement-using-the-source-api">When to implement 
using the Source API</a></li>
+      <li><a href="#when-to-implement-using-the-source-api" 
id="markdown-toc-when-to-implement-using-the-source-api">When to implement 
using the <code class="highlighter-rouge">Source</code> API</a></li>
     </ul>
   </li>
   <li><a href="#write-transforms" id="markdown-toc-write-transforms">Write 
transforms</a>    <ul>
-      <li><a href="#when-to-implement-using-the-sink-api" 
id="markdown-toc-when-to-implement-using-the-sink-api">When to implement using 
the Sink API</a></li>
+      <li><a href="#when-to-implement-using-the-sink-api" 
id="markdown-toc-when-to-implement-using-the-sink-api">When to implement using 
the <code class="highlighter-rouge">Sink</code> API</a></li>
     </ul>
   </li>
 </ul>
 
 <h2 id="introduction">Introduction</h2>
-<p>TODO</p>
+<p>This guide covers how to implement I/O transforms in the Beam model. Beam 
pipelines use these read and write transforms to import data for processing, 
and write data to a store.</p>
+
+<p>Reading and writing data in Beam is a parallel task, and using <code 
class="highlighter-rouge">ParDo</code>s, <code 
class="highlighter-rouge">GroupByKey</code>s, etc… is usually sufficient. 
Rarely, you will need the more specialized <code 
class="highlighter-rouge">Source</code> and <code 
class="highlighter-rouge">Sink</code> classes for specific features. There are 
changes coming soon (<code class="highlighter-rouge">SplittableDoFn</code>, <a 
href="https://issues.apache.org/jira/browse/BEAM-65";>BEAM-65</a>) that will 
make <code class="highlighter-rouge">Source</code> unnecessary.</p>
 
-<h2 id="example-io-transforms">Example I/O Transforms</h2>
-<p>TODO</p>
+<p>As you work on your I/O Transform, be aware that the Beam community is 
excited to help those building new I/O Transforms and that there are many 
examples and helper classes.</p>
 
 <h2 id="suggested-steps-for-implementers">Suggested steps for implementers</h2>
-<p>TODO</p>
+<ol>
+  <li>Check out this guide and come up with your design. If you’d like, you 
can email the <a href="/get-started/support">Beam dev mailing list</a> with any 
questions you might have. It’s good to check there to see if anyone else is 
working on the same I/O Transform.</li>
+  <li>If you are planning to contribute your I/O transform to the Beam 
community, you’ll be going through the normal Beam contribution life cycle - 
see the <a href="/contribute/contribution-guide/">Apache Beam Contribution 
Guide</a> for more details.</li>
+  <li>As you’re working on your IO transform, see the <a 
href="/contribute/ptransform-style-guide/">PTransform Style Guide</a> for 
specific information about writing I/O Transforms.</li>
+</ol>
 
 <h2 id="read-transforms">Read transforms</h2>
-<p>TODO</p>
+<p>Read transforms take data from outside of the Beam pipeline and produce 
<code class="highlighter-rouge">PCollection</code>s of data.</p>
 
-<h3 id="when-to-implement-using-the-source-api">When to implement using the 
Source API</h3>
-<p>TODO</p>
+<p>For data stores or file types where the data can be read in parallel, you 
can think of the process as a mini-pipeline. This often consists of two 
steps:</p>
+<ol>
+  <li>Splitting the data into parts to be read in parallel</li>
+  <li>Reading from each of those parts</li>
+</ol>
 
-<h2 id="write-transforms">Write transforms</h2>
-<p>TODO</p>
+<p>Each of those steps will be a <code class="highlighter-rouge">ParDo</code>, 
with a <code class="highlighter-rouge">GroupByKey</code> in between. The <code 
class="highlighter-rouge">GroupByKey</code> is an implementation detail, but 
for most runners it allows the runner to use different numbers of workers 
for:</p>
+<ul>
+  <li>Determining how to split up the data to be read into chunks - this will 
likely occur on very few workers</li>
+  <li>Reading - will likely benefit from more workers</li>
+</ul>
 
-<h3 id="when-to-implement-using-the-sink-api">When to implement using the Sink 
API</h3>
-<p>TODO</p>
+<p>The <code class="highlighter-rouge">GroupByKey</code> will also allow 
Dynamic Work Rebalancing to occur (on supported runners).</p>
 
-<h1 id="next-steps">Next steps</h1>
+<p>Here are some examples of read transform implementations that use the 
“reading as a mini-pipeline” model when data can be read in parallel:</p>
+<ul>
+  <li><strong>Reading from a file glob</strong> - For example reading all 
files in “~/data/**”
+    <ul>
+      <li>Get File Paths <code class="highlighter-rouge">ParDo</code>: As 
input, take in a file glob. Produce a <code 
class="highlighter-rouge">PCollection</code> of strings, each of which is a 
file path.</li>
+      <li>Reading <code class="highlighter-rouge">ParDo</code>: Given the 
<code class="highlighter-rouge">PCollection</code> of file paths, read each 
one, producing a <code class="highlighter-rouge">PCollection</code> of 
records.</li>
+    </ul>
+  </li>
+  <li><strong>Reading from a NoSQL Database</strong> (eg Apache HBase) - these 
databases often allow reading from ranges in parallel.
+    <ul>
+      <li>Determine Key Ranges <code class="highlighter-rouge">ParDo</code>: 
As input, receive connection information for the database and the key range to 
read from. Produce a <code class="highlighter-rouge">PCollection</code> of key 
ranges that can be read in parallel efficiently.</li>
+      <li>Read Key Range <code class="highlighter-rouge">ParDo</code>: Given 
the <code class="highlighter-rouge">PCollection</code> of key ranges, read the 
key range, producing a <code class="highlighter-rouge">PCollection</code> of 
records.</li>
+    </ul>
+  </li>
+</ul>
+
+<p>For data stores or files where reading cannot occur in parallel, reading is 
a simple task that can be accomplished with a single <code 
class="highlighter-rouge">ParDo</code>+<code 
class="highlighter-rouge">GroupByKey</code>. For example:</p>
+<ul>
+  <li><strong>Reading from a database query</strong> - traditional SQL 
database queries often can only be read in sequence. The <code 
class="highlighter-rouge">ParDo</code> in this case would establish a 
connection to the database and read batches of records, producing a <code 
class="highlighter-rouge">PCollection</code> of those records.</li>
+  <li><strong>Reading from a gzip file</strong> - a gzip file has to be read 
in order, so it cannot be parallelized. The <code 
class="highlighter-rouge">ParDo</code> in this case would open the file and 
read in sequence, producing a <code 
class="highlighter-rouge">PCollection</code> of records from the file.</li>
+</ul>
 
-<p>For more details on actual implementation, continue with one of the the 
language specific guides:</p>
+<h3 id="when-to-implement-using-the-source-api">When to implement using the 
<code class="highlighter-rouge">Source</code> API</h3>
+<p>The above discussion is in terms of <code 
class="highlighter-rouge">ParDo</code>s - this is because <code 
class="highlighter-rouge">Source</code>s have proven to be tricky to implement. 
At this point in time, the recommendation is to <strong>use  <code 
class="highlighter-rouge">Source</code> only if <code 
class="highlighter-rouge">ParDo</code> doesn’t meet your needs</strong>. A 
class derived from <code class="highlighter-rouge">FileBasedSource</code> is 
often the best option when reading from files.</p>
 
+<p>If you’re trying to decide on whether or not to use <code 
class="highlighter-rouge">Source</code>, feel free to email the <a 
href="/get-started/support">Beam dev mailing list</a> and we can discuss the 
specific pros and cons of your case.</p>
+
+<p>In some cases implementing a <code class="highlighter-rouge">Source</code> 
may be necessary or result in better performance.</p>
 <ul>
-  <li><a href="/documentation/io/authoring-python/">Authoring I/O Transforms - 
Python</a></li>
-  <li><a href="/documentation/io/authoring-java/">Authoring I/O Transforms - 
Java</a></li>
+  <li><code class="highlighter-rouge">ParDo</code>s will not work for reading 
from unbounded sources - they do not support checkpointing and don’t support 
mechanisms like de-duping that have proven useful for streaming data 
sources.</li>
+  <li><code class="highlighter-rouge">ParDo</code>s cannot provide hints to 
runners about their progress or the size of data they are reading -  without 
size estimation of the data or progress on your read, the runner doesn’t have 
any way to guess how large your read will be, and thus if it attempts to 
dynamically allocate workers, it does not have any clues as to how many workers 
you may need for your pipeline.</li>
+  <li><code class="highlighter-rouge">ParDo</code>s do not support Dynamic 
Work Rebalancing - these are features used by some readers to improve the 
processing speed of jobs (but may not be possible with your data source).</li>
+  <li><code class="highlighter-rouge">ParDo</code>s do not receive 
‘desired_bundle_size’ as a hint from runners when performing initial 
splitting.
+<code class="highlighter-rouge">SplittableDoFn</code> (<a 
href="https://issues.apache.org/jira/browse/BEAM-65";>BEAM-65</a>) will mitigate 
many of these concerns.</li>
 </ul>
 
+<h2 id="write-transforms">Write transforms</h2>
+<p>Write transforms are responsible for taking the contents of a <code 
class="highlighter-rouge">PCollection</code> and transferring that data outside 
of the Beam pipeline.</p>
+
+<p>Write transforms can usually be implemented using a single <code 
class="highlighter-rouge">ParDo</code> that writes the records received to the 
data store.</p>
+
+<p>TODO: this section needs further explanation.</p>
+
+<h3 id="when-to-implement-using-the-sink-api">When to implement using the 
<code class="highlighter-rouge">Sink</code> API</h3>
+<p>You are strongly discouraged from using the <code 
class="highlighter-rouge">Sink</code> class unless you are creating a <code 
class="highlighter-rouge">FileBasedSink</code>. Most of the time, a simple 
<code class="highlighter-rouge">ParDo</code> is all that’s necessary. If you 
think you have a case that is only possible using a <code 
class="highlighter-rouge">Sink</code>, please email the <a 
href="/get-started/support">Beam dev mailing list</a>.</p>
+
+<h1 id="next-steps">Next steps</h1>
+
+<p>This guide is still in progress. There is an open issue to finish the 
guide: <a 
href="https://issues.apache.org/jira/browse/BEAM-1025";>BEAM-1025</a>.</p>
+
+<!-- TODO: commented out until this content is ready.
+For more details on actual implementation, continue with one of the the 
language specific guides:
+
+* [Authoring I/O Transforms - Python](/documentation/io/authoring-python/)
+* [Authoring I/O Transforms - Java](/documentation/io/authoring-java/)
+-->
+
       </div>
 
 

http://git-wip-us.apache.org/repos/asf/beam-site/blob/6bded068/content/documentation/io/io-toc/index.html
----------------------------------------------------------------------
diff --git a/content/documentation/io/io-toc/index.html 
b/content/documentation/io/io-toc/index.html
index 3e35922..a8add8e 100644
--- a/content/documentation/io/io-toc/index.html
+++ b/content/documentation/io/io-toc/index.html
@@ -165,15 +165,16 @@
   <p>Note: This guide is still in progress. There is an open issue to finish 
the guide: <a 
href="https://issues.apache.org/jira/browse/BEAM-1025";>BEAM-1025</a>.</p>
 </blockquote>
 
-<!-- TODO: commented out until this content is ready.
-
-This series of articles will walk you through the process of creating a new 
I/O transform. 
+<ul>
+  <li><a href="/documentation/io/authoring-overview/">Authoring I/O Transforms 
- Overview</a></li>
+</ul>
 
-* [Authoring I/O Transforms - Overview](/documentation/io/authoring-overview/)
+<!-- TODO: commented out until this content is ready.
 * [Authoring I/O Transforms - Python](/documentation/io/authoring-python/)
 * [Authoring I/O Transforms - Java](/documentation/io/authoring-java/)
 * [Testing I/O Transforms](/documentation/io/testing/)
-* [Contributing I/O Transforms](/documentation/io/contributing/) -->
+* [Contributing I/O Transforms](/documentation/io/contributing/)
+-->
 
       </div>
 

Reply via email to