Regenerate website
Project: http://git-wip-us.apache.org/repos/asf/beam-site/repo Commit: http://git-wip-us.apache.org/repos/asf/beam-site/commit/e627b278 Tree: http://git-wip-us.apache.org/repos/asf/beam-site/tree/e627b278 Diff: http://git-wip-us.apache.org/repos/asf/beam-site/diff/e627b278 Branch: refs/heads/asf-site Commit: e627b27880ea4b7159063de5f0eab1bdd59a511b Parents: 2dd2c59 Author: Ahmet Altay <al...@google.com> Authored: Fri Feb 10 12:05:21 2017 -0800 Committer: Ahmet Altay <al...@google.com> Committed: Fri Feb 10 12:05:21 2017 -0800 ---------------------------------------------------------------------- .../python-pipeline-dependencies/index.html | 316 +++++++++++++++++++ content/documentation/sdks/python/index.html | 3 + 2 files changed, 319 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/beam-site/blob/e627b278/content/documentation/sdks/python-pipeline-dependencies/index.html ---------------------------------------------------------------------- diff --git a/content/documentation/sdks/python-pipeline-dependencies/index.html b/content/documentation/sdks/python-pipeline-dependencies/index.html new file mode 100644 index 0000000..4107f5d --- /dev/null +++ b/content/documentation/sdks/python-pipeline-dependencies/index.html @@ -0,0 +1,316 @@ +<!DOCTYPE html> +<html lang="en"> + + <head> + <meta charset="utf-8"> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + + <title>Managing Python Pipeline Dependencies</title> + <meta name="description" content="Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing workflows, and also data ingestion and integration flows, supporting Enterprise Integration Patterns (EIPs) and Domain Specific Languages (DSLs). Dataflow pipelines simplify the mechanics of large-scale batch and streaming data processing and can run on a number of runtimes like Apache Flink, Apache Spark, and Google Cloud Dataflow (a cloud service). Beam also brings DSL in different languages, allowing users to easily implement their data integration processes. +"> + + <link rel="stylesheet" href="/styles/site.css"> + <link rel="stylesheet" href="/css/theme.css"> + <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js"></script> + <script src="/js/bootstrap.min.js"></script> + <script src="/js/language-switch.js"></script> + <link rel="canonical" href="https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/" data-proofer-ignore> + <link rel="alternate" type="application/rss+xml" title="Apache Beam" href="https://beam.apache.org/feed.xml"> + <script> + (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ + (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), + m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) + })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); + + ga('create', 'UA-73650088-1', 'auto'); + ga('send', 'pageview'); + + </script> + <link rel="shortcut icon" type="image/x-icon" href="/images/favicon.ico"> +</head> + + + <body role="document"> + + <nav class="navbar navbar-default navbar-fixed-top"> + <div class="container"> + <div class="navbar-header"> + <a href="/" class="navbar-brand" > + <img alt="Brand" style="height: 25px" src="/images/beam_logo_navbar.png"> + </a> + <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar" aria-expanded="false" aria-controls="navbar"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + </div> + <div id="navbar" class="navbar-collapse collapse"> + <ul class="nav navbar-nav"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Get Started <span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/get-started/beam-overview/">Beam Overview</a></li> + <li><a href="/get-started/quickstart-java/">Quickstart - Java</a></li> + <li><a href="/get-started/quickstart-py/">Quickstart - Python</a></li> + <li role="separator" class="divider"></li> + <li class="dropdown-header">Example Walkthroughs</li> + <li><a href="/get-started/wordcount-example/">WordCount</a></li> + <li><a href="/get-started/mobile-gaming-example/">Mobile Gaming</a></li> + <li role="separator" class="divider"></li> + <li class="dropdown-header">Resources</li> + <li><a href="/get-started/downloads">Downloads</a></li> + <li><a href="/get-started/support">Support</a></li> + </ul> + </li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Documentation <span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/documentation">Using the Documentation</a></li> + <li role="separator" class="divider"></li> + <li class="dropdown-header">Beam Concepts</li> + <li><a href="/documentation/programming-guide/">Programming Guide</a></li> + <li><a href="/documentation/resources/">Additional Resources</a></li> + <li role="separator" class="divider"></li> + <li class="dropdown-header">Pipeline Fundamentals</li> + <li><a href="/documentation/pipelines/design-your-pipeline/">Design Your Pipeline</a></li> + <li><a href="/documentation/pipelines/create-your-pipeline/">Create Your Pipeline</a></li> + <li><a href="/documentation/pipelines/test-your-pipeline/">Test Your Pipeline</a></li> + <li role="separator" class="divider"></li> + <li class="dropdown-header">SDKs</li> + <li><a href="/documentation/sdks/java/">Java SDK</a></li> + <li><a href="/documentation/sdks/javadoc/0.5.0/" target="_blank">Java SDK API Reference <img src="/images/external-link-icon.png" + width="14" height="14" + alt="External link."></a> + </li> + <li><a href="/documentation/sdks/python/">Python SDK</a></li> + <li role="separator" class="divider"></li> + <li class="dropdown-header">Runners</li> + <li><a href="/documentation/runners/capability-matrix/">Capability Matrix</a></li> + <li><a href="/documentation/runners/direct/">Direct Runner</a></li> + <li><a href="/documentation/runners/apex/">Apache Apex Runner</a></li> + <li><a href="/documentation/runners/flink/">Apache Flink Runner</a></li> + <li><a href="/documentation/runners/spark/">Apache Spark Runner</a></li> + <li><a href="/documentation/runners/dataflow/">Cloud Dataflow Runner</a></li> + </ul> + </li> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">Contribute <span class="caret"></span></a> + <ul class="dropdown-menu"> + <li><a href="/contribute">Get Started Contributing</a></li> + <li role="separator" class="divider"></li> + <li class="dropdown-header">Guides</li> + <li><a href="/contribute/contribution-guide/">Contribution Guide</a></li> + <li><a href="/contribute/testing/">Testing Guide</a></li> + <li><a href="/contribute/release-guide/">Release Guide</a></li> + <li><a href="/contribute/ptransform-style-guide/">PTransform Style Guide</a></li> + <li role="separator" class="divider"></li> + <li class="dropdown-header">Technical References</li> + <li><a href="/contribute/design-principles/">Design Principles</a></li> + <li><a href="/contribute/work-in-progress/">Ongoing Projects</a></li> + <li><a href="/contribute/source-repository/">Source Repository</a></li> + <li role="separator" class="divider"></li> + <li class="dropdown-header">Promotion</li> + <li><a href="/contribute/presentation-materials/">Presentation Materials</a></li> + <li><a href="/contribute/logos/">Logos and Design</a></li> + <li role="separator" class="divider"></li> + <li><a href="/contribute/maturity-model/">Maturity Model</a></li> + <li><a href="/contribute/team/">Team</a></li> + </ul> + </li> + + <li><a href="/blog">Blog</a></li> + </ul> + <ul class="nav navbar-nav navbar-right"> + <li class="dropdown"> + <a href="#" class="dropdown-toggle" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false"><img src="https://www.apache.org/foundation/press/kit/feather_small.png" alt="Apache Logo" style="height:24px;">Apache Software Foundation<span class="caret"></span></a> + <ul class="dropdown-menu dropdown-menu-right"> + <li><a href="http://www.apache.org/">ASF Homepage</a></li> + <li><a href="http://www.apache.org/licenses/">License</a></li> + <li><a href="http://www.apache.org/security/">Security</a></li> + <li><a href="http://www.apache.org/foundation/thanks.html">Thanks</a></li> + <li><a href="http://www.apache.org/foundation/sponsorship.html">Sponsorship</a></li> + <li><a href="https://www.apache.org/foundation/policies/conduct">Code of Conduct</a></li> + </ul> + </li> + </ul> + </div><!--/.nav-collapse --> + </div> +</nav> + + +<link rel="stylesheet" href=""> + + + <div class="container" role="main"> + + <div class="row"> + <h1 id="managing-python-pipeline-dependencies">Managing Python Pipeline Dependencies</h1> + +<blockquote> + <p><strong>Note:</strong> This page is only applicable to runners that do remote execution.</p> +</blockquote> + +<p>When you run your pipeline locally, the packages that your pipeline depends on are available because they are installed on your local machine. However, when you want to run your pipeline remotely, you must make sure these dependencies are available on the remote machines. This tutorial shows you how to make your dependencies available to the remote workers. Each section below refers to a different source that your package may have been installed from.</p> + +<p><strong>Note:</strong> Remote workers used for pipeline execution typically have a standard Python 2.7 distribution installation. If your code relies only on standard Python packages, then you probably donât need to do anything on this page.</p> + +<h2 id="a-namepypiapypi-dependencies"><a name="pypi"></a>PyPI Dependencies</h2> + +<p>If your pipeline uses public packages from the <a href="https://pypi.python.org/pypi">Python Package Index</a>, make these packages available remotely by performing the following steps:</p> + +<p><strong>Note:</strong> If your PyPI package depends on a non-Python package (e.g. a package that requires installation on Linux using the <code class="highlighter-rouge">apt-get install</code> command), see the <a href="#nonpython">PyPI Dependencies with Non-Python Dependencies</a> section instead.</p> + +<ol> + <li> + <p>Find out which packages are installed on your machine. Run the following command:</p> + + <div class="highlighter-rouge"><pre class="highlight"><code> pip freeze > requirements.txt +</code></pre> + </div> + + <p>This command creates a <code class="highlighter-rouge">requirements.txt</code> file that lists all packages that are installed on your machine, regardless of where they were installed from.</p> + </li> + <li> + <p>Edit the <code class="highlighter-rouge">requirements.txt</code> file and leave only the packages that were installed from PyPI and are used in the workflow source. Delete all packages that are not relevant to your code.</p> + </li> + <li> + <p>Run your pipeline with the following command-line option:</p> + + <div class="highlighter-rouge"><pre class="highlight"><code> --requirements_file requirements.txt +</code></pre> + </div> + + <p>The runner will use the <code class="highlighter-rouge">requirements.txt</code> file to install your additional dependencies onto the remote workers.</p> + </li> +</ol> + +<p><strong>Important:</strong> Remote workers will install all packages listed in the <code class="highlighter-rouge">requirements.txt</code> file. Because of this, itâs very important that you delete non-PyPI packages from the <code class="highlighter-rouge">requirements.txt</code> file, as stated in step 2. If you donât remove non-PyPI packages, the remote workers will fail when attempting to install packages from sources that are unknown to them.</p> + +<h2 id="a-namelocalnonpypialocal-or-non-pypi-dependencies"><a name="localnonpypi"></a>Local or non-PyPI Dependencies</h2> + +<p>If your pipeline uses packages that are not available publicly (e.g. packages that youâve downloaded from a GitHub repo), make these packages available remotely by performing the following steps:</p> + +<ol> + <li> + <p>Identify which packages are installed on your machine and are not public. Run the following command:</p> + + <div class="highlighter-rouge"><pre class="highlight"><code> pip freeze +</code></pre> + </div> + + <p>This command lists all packages that are installed on your machine, regardless of where they were installed from.</p> + </li> + <li> + <p>Run your pipeline with the following command-line option:</p> + + <div class="highlighter-rouge"><pre class="highlight"><code> --extra_package /path/to/package/package-name +</code></pre> + </div> + </li> +</ol> + +<h2 id="a-namemultfilesamultiple-file-dependencies"><a name="multfiles"></a>Multiple File Dependencies</h2> + +<p>Often, your pipeline code spans multiple files. To run your project remotely, you must group these files as a Python package and specify the package when you run your pipeline. When the remote workers start, they will install your package. To group your files as a Python package and make it available remotely, perform the following steps:</p> + +<ol> + <li> + <p>Create a <a href="https://pythonhosted.org/an_example_pypi_project/setuptools.html">setup.py</a> file for your project. The following is a very basic <code class="highlighter-rouge">setup.py</code> file.</p> + + <div class="highlighter-rouge"><pre class="highlight"><code> setuptools.setup( + name='PACKAGE-NAME' + version='PACKAGE-VERSION', + install_requires=[], + packages=setuptools.find_packages(), + ) +</code></pre> + </div> + </li> + <li> + <p>Structure your project so that the root directory contains the <code class="highlighter-rouge">setup.py</code> file, the main workflow file, and a directory with the rest of the files.</p> + + <div class="highlighter-rouge"><pre class="highlight"><code> root_dir/ + setup.py + main.py + other_files_dir/ +</code></pre> + </div> + + <p>See <a href="https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset">Juliaset</a> for an example that follows this required project structure.</p> + </li> + <li> + <p>Run your pipeline with the following command-line option:</p> + + <div class="highlighter-rouge"><pre class="highlight"><code> --setup_file /path/to/setup.py +</code></pre> + </div> + </li> +</ol> + +<p><strong>Note:</strong> If you <a href="#pypi">created a requirements.txt file</a> and your project spans multiple files, you can get rid of the <code class="highlighter-rouge">requirements.txt</code> file and instead, add all packages contained in <code class="highlighter-rouge">requirements.txt</code> to the <code class="highlighter-rouge">install_requires</code> field of the setup call (in step 1).</p> + +<h2 id="a-namenonpythonanon-python-dependencies-or-pypi-dependencies-with-non-python-dependencies"><a name="nonpython"></a>Non-Python Dependencies or PyPI Dependencies with Non-Python Dependencies</h2> + +<p>If your pipeline uses non-Python packages (e.g. packages that require installation using the <code class="highlighter-rouge">apt-get install</code> command), or uses a PyPI package that depends on non-Python dependencies during package installation, you must perform the following steps.</p> + +<ol> + <li> + <p>Add the required installation commands (e.g. the <code class="highlighter-rouge">apt-get install</code> commands) for the non-Python dependencies to the list of <code class="highlighter-rouge">CUSTOM_COMMANDS</code> in your <code class="highlighter-rouge">setup.py</code> file. See the <a href="https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/complete/juliaset/setup.py">Juliaset setup.py</a> for an example.</p> + + <p><strong>Note:</strong> You must make sure that these commands are runnable on the remote worker (e.g. if you use <code class="highlighter-rouge">apt-get</code>, the remote worker needs <code class="highlighter-rouge">apt-get</code> support).</p> + </li> + <li> + <p>If you are using a PyPI package that depends on non-Python dependencies, add <code class="highlighter-rouge">['pip', 'install', '<your PyPI package>']</code> to the list of <code class="highlighter-rouge">CUSTOM_COMMANDS</code> in your <code class="highlighter-rouge">setup.py</code> file.</p> + </li> + <li> + <p>Structure your project so that the root directory contains the <code class="highlighter-rouge">setup.py</code> file, the main workflow file, and a directory with the rest of the files.</p> + + <div class="highlighter-rouge"><pre class="highlight"><code> root_dir/ + setup.py + main.py + other_files_dir/ +</code></pre> + </div> + + <p>See the <a href="https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples/complete/juliaset">Juliaset</a> project for an example that follows this required project structure.</p> + </li> + <li> + <p>Run your pipeline with the following command-line option:</p> + + <div class="highlighter-rouge"><pre class="highlight"><code> --setup_file /path/to/setup.py +</code></pre> + </div> + </li> +</ol> + +<p><strong>Note:</strong> Because custom commands execute after the dependencies for your workflow are installed (by <code class="highlighter-rouge">pip</code>), you should omit the PyPI package dependency from the pipelineâs <code class="highlighter-rouge">requirements.txt</code> file and from the <code class="highlighter-rouge">install_requires</code> parameter in the <code class="highlighter-rouge">setuptools.setup()</code> call of your <code class="highlighter-rouge">setup.py</code> file.</p> + + + </div> + + + <hr> + <div class="row"> + <div class="col-xs-12"> + <footer> + <p class="text-center"> + © Copyright + <a href="http://www.apache.org">The Apache Software Foundation</a>, + 2017. All Rights Reserved. + </p> + <p class="text-center"> + <a href="/privacy_policy">Privacy Policy</a> | + <a href="/feed.xml">RSS Feed</a> + </p> + </footer> + </div> + </div> + <!-- container div end --> +</div> + + + </body> + +</html> http://git-wip-us.apache.org/repos/asf/beam-site/blob/e627b278/content/documentation/sdks/python/index.html ---------------------------------------------------------------------- diff --git a/content/documentation/sdks/python/index.html b/content/documentation/sdks/python/index.html index 3924ebe..6ae91b0 100644 --- a/content/documentation/sdks/python/index.html +++ b/content/documentation/sdks/python/index.html @@ -160,6 +160,9 @@ <p>Python is a dynamically-typed language with no static type checking. The Beam SDK for Python uses type hints during pipeline construction and runtime to try to emulate the correctness guarantees achieved by true static typing. <a href="/documentation/sdks/python-type-safety">Ensuring Python Type Safety</a> walks through how to use type hints, which help you to catch potential bugs up front with the <a href="/documentation/runners/direct/">Direct Runner</a>.</p> +<h2 id="managing-python-pipeline-dependencies">Managing Python Pipeline Dependencies</h2> + +<p>When you run your pipeline locally, the packages that your pipeline depends on are available because they are installed on your local machine. However, when you want to run your pipeline remotely, you must make sure these dependencies are available on the remote machines. <a href="/documentation/sdks/python-pipeline-dependencies">Managing Python Pipeline Dependencies</a> shows you how to make your dependencies available to the remote workers.</p> </div>