Regenerate website
Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/394bfe70 Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/394bfe70 Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/394bfe70 Branch: refs/heads/asf-site Commit: 394bfe70319d92ad68d1c13a40db936445e0bd99 Parents: f23d9cb Author: Davor Bonaci <[email protected]> Authored: Tue Apr 18 15:45:02 2017 -0700 Committer: Davor Bonaci <[email protected]> Committed: Tue Apr 18 15:45:02 2017 -0700 ---------------------------------------------------------------------- .../documentation/runners/dataflow/index.html | 79 ++++++++++++++++---- content/documentation/runners/direct/index.html | 30 ++++++-- 2 files changed, 90 insertions(+), 19 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/beam-site/blob/394bfe70/content/documentation/runners/dataflow/index.html ---------------------------------------------------------------------- diff --git a/content/documentation/runners/dataflow/index.html b/content/documentation/runners/dataflow/index.html index 2f3d9b0..4dda742 100644 --- a/content/documentation/runners/dataflow/index.html +++ b/content/documentation/runners/dataflow/index.html @@ -153,6 +153,14 @@ <div class="row"> <h1 id="using-the-google-cloud-dataflow-runner">Using the Google Cloud Dataflow Runner</h1> +<nav class="language-switcher"> + <strong>Adapt for:</strong> + <ul> + <li data-type="language-java" class="active">Java SDK</li> + <li data-type="language-py">Python SDK</li> + </ul> +</nav> + <p>The Google Cloud Dataflow Runner uses the <a href="https://cloud.google.com/dataflow/service/dataflow-service-desc">Cloud Dataflow managed service</a>. When you run your pipeline with the Cloud Dataflow service, the runner uploads your executable code and dependencies to a Google Cloud Storage bucket and creates a Cloud Dataflow job, which executes your pipeline on managed resources in Google Cloud Platform.</p> <p>The Cloud Dataflow Runner and service are suitable for large scale, continuous jobs, and provide:</p> @@ -202,8 +210,7 @@ <h3 id="specify-your-dependency">Specify your dependency</h3> -<p>You must specify your dependency on the Cloud Dataflow Runner.</p> - +<p><span class="language-java">When using Java, you must specify your dependency on the Cloud Dataflow Runner in your <code class="highlighter-rouge">pom.xml</code>.</span></p> <div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="o"><</span><span class="n">dependency</span><span class="o">></span> <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">beam</span><span class="o"></</span><span class="n">groupId</span><span class="o">></span> <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">beam</span><span class="o">-</span><span class="n">runners</span><span class="o">-</span><span class="n">google</span><span class="o">-</span><span class="n">cloud</span><span class="o">-</span><span class="n">dataflow</span><span class="o">-</span><span class="n">java</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span> @@ -213,6 +220,8 @@ </code></pre> </div> +<p><span class="language-py">This section is not applicable to the Beam SDK for Python.</span></p> + <h3 id="authentication">Authentication</h3> <p>Before running your pipeline, you must authenticate with the Google Cloud Platform. Run the following command to get <a href="https://developers.google.com/identity/protocols/application-default-credentials">Application Default Credentials</a>.</p> @@ -223,7 +232,8 @@ <h2 id="pipeline-options-for-the-cloud-dataflow-runner">Pipeline options for the Cloud Dataflow Runner</h2> -<p>When executing your pipeline with the Cloud Dataflow Runner, set these pipeline options.</p> +<p><span class="language-java">When executing your pipeline with the Cloud Dataflow Runner (Java), consider these common pipeline options.</span> +<span class="language-py">When executing your pipeline with the Cloud Dataflow Runner (Python), consider these common pipeline options.</span></p> <table class="table table-bordered"> <tr> @@ -231,39 +241,80 @@ <th>Description</th> <th>Default Value</th> </tr> + <tr> <td><code>runner</code></td> <td>The pipeline runner to use. This option allows you to determine the pipeline runner at runtime.</td> - <td>Set to <code>dataflow</code> to run on the Cloud Dataflow Service.</td> + <td>Set to <code>dataflow</code> or <code>DataflowRunner</code> to run on the Cloud Dataflow Service.</td> </tr> + <tr> <td><code>project</code></td> <td>The project ID for your Google Cloud Project.</td> <td>If not set, defaults to the default project in the current environment. The default project is set via <code>gcloud</code>.</td> </tr> -<tr> + +<!-- Only show for Java --> +<tr class="language-java"> <td><code>streaming</code></td> <td>Whether streaming mode is enabled or disabled; <code>true</code> if enabled. Set to <code>true</code> if running pipelines with unbounded <code>PCollection</code>s.</td> <td><code>false</code></td> </tr> + <tr> - <td><code>tempLocation</code></td> - <td>Optional. Path for temporary files. If set to a valid Google Cloud Storage URL that begins with <code>gs://</code>, <code>tempLocation</code> is used as the default value for <code>gcpTempLocation</code>.</td> + <td> + <span class="language-java"><code>tempLocation</code></span> + <span class="language-py"><code>temp_location</code></span> + </td> + <td> + <span class="language-java">Optional.</span> + <span class="language-py">Required.</span> + Path for temporary files. Must be a valid Google Cloud Storage URL that begins with <code>gs://</code>. + <span class="language-java">If set, <code>tempLocation</code> is used as the default value for <code>gcpTempLocation</code>.</span> + </td> <td>No default value.</td> </tr> -<tr> + +<!-- Only show for Java --> +<tr class="language-java"> <td><code>gcpTempLocation</code></td> <td>Cloud Storage bucket path for temporary files. Must be a valid Cloud Storage URL that begins with <code>gs://</code>.</td> <td>If not set, defaults to the value of <code>tempLocation</code>, provided that <code>tempLocation</code> is a valid Cloud Storage URL. If <code>tempLocation</code> is not a valid Cloud Storage URL, you must set <code>gcpTempLocation</code>.</td> </tr> + <tr> - <td><code>stagingLocation</code></td> + <td> + <span class="language-java"><code>stagingLocation</code></span> + <span class="language-py"><code>staging_location</code></span> + </td> <td>Optional. Cloud Storage bucket path for staging your binary and any temporary files. Must be a valid Cloud Storage URL that begins with <code>gs://</code>.</td> - <td>If not set, defaults to a staging directory within <code>gcpTempLocation</code>.</td> + <td> + <span class="language-java">If not set, defaults to a staging directory within <code>gcpTempLocation</code>.</span> + <span class="language-py">If not set, defaults to a staging directory within <code>temp_location</code>.</span> + </td> +</tr> + +<!-- Only show for Python --> +<tr class="language-py"> + <td><code>save_main_session</code></td> + <td>Save the main session state so that pickled functions and classes defined in <code>__main__</code> (e.g. interactive session) can be unpickled. Some workflows do not need the session state if, for instance, all of their functions/classes are defined in proper modules (not <code>__main__</code>) and the modules are importable in the worker.</td> + <td><code>false</code></td> </tr> + +<!-- Only show for Python --> +<tr class="language-py"> + <td><code>sdk_location</code></td> + <td>Override the default location from where the Beam SDK is downloaded. This value can be an URL, a Cloud Storage path, or a local path to an SDK tarball. Workflow submissions will download or copy the SDK tarball from this location. If set to the string <code>default</code>, a standard SDK location is used. If empty, no SDK is copied.</td> + <td><code>default</code></td> +</tr> + + </table> -<p>See the reference documentation for the <span class="language-java"><a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html">DataflowPipelineOptions</a></span><span class="language-python"><a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py">PipelineOptions</a></span> interface (and its subinterfaces) for the complete list of pipeline configuration options.</p> +<p>See the reference documentation for the +<span class="language-java"><a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/runners/dataflow/options/DataflowPipelineOptions.html">DataflowPipelineOptions</a></span> +<span class="language-py"><a href="/documentation/sdks/pydoc/0.6.0/apache_beam.utils.html#apache_beam.utils.pipeline_options.PipelineOptions"><code class="highlighter-rouge">PipelineOptions</code></a></span> +interface (and any subinterfaces) for additional pipeline configuration options.</p> <h2 id="additional-information-and-caveats">Additional information and caveats</h2> @@ -273,11 +324,13 @@ <h3 id="blocking-execution">Blocking Execution</h3> -<p>To connect to your job and block until it is completed, call <code class="highlighter-rouge">waitToFinish</code> on the <code class="highlighter-rouge">PipelineResult</code> returned from <code class="highlighter-rouge">pipeline.run()</code>. The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing <strong>Ctrl+C</strong> from the command line does not cancel your job. To cancel the job, you can use the <a href="https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf">Dataflow Monitoring Interface</a> or the <a href="https://cloud.google.com/dataflow/pipelines/dataflow-command-line-intf">Dataflow Command-line Interface</a>.</p> +<p>To block until your job completes, call <span class="language-java"><code>waitToFinish</code></span><span class="language-py"><code>wait_until_finish</code></span> on the <code class="highlighter-rouge">PipelineResult</code> returned from <code class="highlighter-rouge">pipeline.run()</code>. The Cloud Dataflow Runner prints job status updates and console messages while it waits. While the result is connected to the active job, note that pressing <strong>Ctrl+C</strong> from the command line does not cancel your job. To cancel the job, you can use the <a href="https://cloud.google.com/dataflow/pipelines/dataflow-monitoring-intf">Dataflow Monitoring Interface</a> or the <a href="https://cloud.google.com/dataflow/pipelines/dataflow-command-line-intf">Dataflow Command-line Interface</a>.</p> <h3 id="streaming-execution">Streaming Execution</h3> -<p>If your pipeline uses an unbounded data source or sink, you must set the <code class="highlighter-rouge">streaming</code> option to <code class="highlighter-rouge">true</code>.</p> +<p><span class="language-java">If your pipeline uses an unbounded data source or sink, you must set the <code class="highlighter-rouge">streaming</code> option to <code class="highlighter-rouge">true</code>.</span> +<span class="language-py">The Beam SDK for Python does not currently support streaming pipelines.</span></p> + </div> http://git-wip-us.apache.org/repos/asf/beam-site/blob/394bfe70/content/documentation/runners/direct/index.html ---------------------------------------------------------------------- diff --git a/content/documentation/runners/direct/index.html b/content/documentation/runners/direct/index.html index dd1151a..15c2d8b 100644 --- a/content/documentation/runners/direct/index.html +++ b/content/documentation/runners/direct/index.html @@ -153,6 +153,14 @@ <div class="row"> <h1 id="using-the-direct-runner">Using the Direct Runner</h1> +<nav class="language-switcher"> + <strong>Adapt for:</strong> + <ul> + <li data-type="language-java" class="active">Java SDK</li> + <li data-type="language-py">Python SDK</li> + </ul> +</nav> + <p>The Direct Runner executes pipelines on your machine and is designed to validate that pipelines adhere to the Apache Beam model as closely as possible. Instead of focusing on efficient pipeline execution, the Direct Runner performs additional checks to ensure that users do not rely on semantics that are not guaranteed by the model. Some of these checks include:</p> <ul> @@ -166,14 +174,19 @@ <p>Here are some resources with information about how to test your pipelines.</p> <ul> - <li><a href="/blog/2016/10/20/test-stream.html">Testing Unbounded Pipelines in Apache Beam</a> talks about the use of Java classes <a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/sdk/testing/PAssert.html"><code class="highlighter-rouge">PAssert</code></a> and <a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/sdk/testing/TestStream.html"><code class="highlighter-rouge">TestStream</code></a> to test your pipelines.</li> - <li>The <a href="/get-started/wordcount-example/">Apache Beam WordCount Example</a> contains an example of logging and testing a pipeline with <a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/sdk/testing/PAssert.html"><code class="highlighter-rouge">PAssert</code></a>.</li> + <!-- Java specific links --> + <li class="language-java"><a href="/blog/2016/10/20/test-stream.html">Testing Unbounded Pipelines in Apache Beam</a> talks about the use of Java classes <a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/sdk/testing/PAssert.html">PAssert</a> and <a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/sdk/testing/TestStream.html">TestStream</a> to test your pipelines.</li> + <li class="language-java">The <a href="/get-started/wordcount-example/#testing-your-pipeline-via-passert">Apache Beam WordCount Example</a> contains an example of logging and testing a pipeline with <a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/sdk/testing/PAssert.html"><code>PAssert</code></a>.</li> + + <!-- Python specific links --> + <li class="language-py">You can use <a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L206">assert_that</a> to test your pipeline. The Python <a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/wordcount_debugging.py">WordCount Debugging Example</a> contains an example of logging and testing with <a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/util.py#L206"><code>assert_that</code></a>.</li> </ul> <h2 id="direct-runner-prerequisites-and-setup">Direct Runner prerequisites and setup</h2> -<p>You must specify your dependency on the Direct Runner.</p> +<h3 id="specify-your-dependency">Specify your dependency</h3> +<p><span class="language-java">When using Java, you must specify your dependency on the Direct Runner in your <code class="highlighter-rouge">pom.xml</code>.</span></p> <div class="language-java highlighter-rouge"><pre class="highlight"><code><span class="o"><</span><span class="n">dependency</span><span class="o">></span> <span class="o"><</span><span class="n">groupId</span><span class="o">></span><span class="n">org</span><span class="o">.</span><span class="na">apache</span><span class="o">.</span><span class="na">beam</span><span class="o"></</span><span class="n">groupId</span><span class="o">></span> <span class="o"><</span><span class="n">artifactId</span><span class="o">></span><span class="n">beam</span><span class="o">-</span><span class="n">runners</span><span class="o">-</span><span class="n">direct</span><span class="o">-</span><span class="n">java</span><span class="o"></</span><span class="n">artifactId</span><span class="o">></span> @@ -183,15 +196,20 @@ </code></pre> </div> +<p><span class="language-py">This section is not applicable to the Beam SDK for Python.</span></p> + <h2 id="pipeline-options-for-the-direct-runner">Pipeline options for the Direct Runner</h2> -<p>When executing your pipeline from the command-line, set <code class="highlighter-rouge">runner</code> to <code class="highlighter-rouge">direct</code>. The default values for the other pipeline options are generally sufficient.</p> +<p>When executing your pipeline from the command-line, set <code class="highlighter-rouge">runner</code> to <code class="highlighter-rouge">direct</code> or <code class="highlighter-rouge">DirectRunner</code>. The default values for the other pipeline options are generally sufficient.</p> -<p>See the reference documentation for the <span class="language-java"><a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/runners/direct/DirectOptions.html"><code class="highlighter-rouge">DirectOptions</code></a></span><span class="language-python"><a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/pipeline_options.py"><code class="highlighter-rouge">PipelineOptions</code></a></span> interface (and its subinterfaces) for defaults and the complete list of pipeline configuration options.</p> +<p>See the reference documentation for the +<span class="language-java"><a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/runners/direct/DirectOptions.html"><code class="highlighter-rouge">DirectOptions</code></a></span> +<span class="language-py"><a href="/documentation/sdks/pydoc/0.6.0/apache_beam.utils.html#apache_beam.utils.pipeline_options.DirectOptions"><code class="highlighter-rouge">DirectOptions</code></a></span> +interface for defaults and additional pipeline configuration options.</p> <h2 id="additional-information-and-caveats">Additional information and caveats</h2> -<p>Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with data sets small enough to fit in local memory. You can create a small in-memory data set using a <span class="language-java"><a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/sdk/transforms/Create.html"><code class="highlighter-rouge">Create</code></a></span><span class="language-python"><a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py"><code class="highlighter-rouge">Create</code></a></span> transform, or you can use a <span class="language-java"><a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/sdk/io/Read.html"><code class="highlighter-rouge">Read</code></a></span><span class="language-python"><a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py"><code class="highlighter-rouge">Read</code></a></span> transform to work with s mall local or remote files.</p> +<p>Local execution is limited by the memory available in your local environment. It is highly recommended that you run your pipeline with data sets small enough to fit in local memory. You can create a small in-memory data set using a <span class="language-java"><a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/sdk/transforms/Create.html"><code class="highlighter-rouge">Create</code></a></span><span class="language-py"><a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py"><code class="highlighter-rouge">Create</code></a></span> transform, or you can use a <span class="language-java"><a href="/documentation/sdks/javadoc/0.6.0/index.html?org/apache/beam/sdk/io/Read.html"><code class="highlighter-rouge">Read</code></a></span><span class="language-py"><a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py"><code class="highlighter-rouge">Read</code></a></span> transform to work with small loc al or remote files.</p> </div>
