rszper commented on code in PR #27749:
URL: https://github.com/apache/beam/pull/27749#discussion_r1278182441
##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -160,3 +159,80 @@ Since serialization of the pipeline happens on the job
submission, and deseriali
To ensure this, Beam typically sets a very narrow supported version range for
pickling libraries. If for whatever reason, users cannot use the version of
`dill` or `cloudpickle` required by Beam, and choose to
install a custom version, they must also ensure that they use the same custom
version at runtime (e.g. in their custom container,
or by specifying a pipeline dependency requirement).
+
+## Control the dependencies the pipeline uses {#control-dependencies}
+
+### Pipeline environments
+
+To run a Python pipeline on a remote runner, Apache Beam translates the
pipeline into a [runner-independent
representation](https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto)
and submits it for execution. Translation happens in the **launch
environment**. You can launch the pipeline from a Python virtual environment
with installed Beam SDK, or with tools like [Dataflow Flex
Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates),
[Notebook
environments](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development),
[Apache Airflow](https://airflow.apache.org/) and more.
+
+The [**runtime
environment**](https://beam.apache.org/documentation/runtime/environments/) is
the Python environment that a runner uses during pipeline execution. This
environment is where the pipeline code runs to perform data processing. The
runtime environment includes Apache Beam and pipeline runtime dependencies.
+
+### Create reproducible environments {#create-reproducible-environments}
+
+You can use several tools to build reproducible Python environments:
+
+* **Use [requirements
files](https://pip.pypa.io/en/stable/user_guide/#requirements-files).** After
you install dependencies, generate the requirements file by using `pip freeze >
requirements.txt`. To recreate an environment, install dependencies from the
requirements.txt file by using `pip install -r requirements.txt`.
+
+* **Use [constraint
files](https://pip.pypa.io/en/stable/user_guide/#constraints-files).** You can
use the constraint list to restrict the installation of packages, allowing
only specified versions.
+
+* **Use lock files.** Use dependency management tools like
[PipEnv](https://pipenv.pypa.io/en/latest/),
[Poetry](https://python-poetry.org/), and
[pip-tools](https://github.com/jazzband/pip-tools) to specify top-level
dependencies, to generate lock files of all transitive dependencies with pinned
versions, and to create virtual environments from these lockfiles.
+
+* **Use Docker container images.** You can package the launch and runtime
environment inside a Docker container image. If the image includes all
necessary dependencies, then the environment only changes when a container
image is rebuilt.
+
+Use version control for the configuration files that define the environment.
+
+### Make the pipeline runtime environment reproducible
+
+When a pipeline uses a reproducible runtime environment on a remote runner,
the workers on the runner use the same dependencies each time the pipeline
runs. A reproducible environment is immune to side-effects caused by releases
of the pipeline's direct or transitive dependencies. It doesn’t require
dependency resolution at runtime.
+
+You can create a reproducible runtime environment in the following ways:
+
+* Run your pipeline in a custom container image that has all dependencies for
your pipeline. Use the `--sdk_container_image` pipeline option.
+
+* Supply an exhaustive list of the pipeline's dependencies in the
`--requirements_file` pipeline option. Use the
`--prebuild_sdk_container_engine` option to perform the runtime environment
initialization sequence before the pipeline execution. If your dependencies
don't change, reuse the prebuilt image by using the `--sdk_container_image`
option.
+
+A self-contained runtime environment is usually reproducible. To check if the
runtime environment is self-contained, restrict internet access to PyPI in the
pipeline runtime. If you use the Dataflow Runner, see the documentation for the
[`--no_use_public_ips`](https://cloud.google.com/dataflow/docs/guides/routes-firewall#turn_off_external_ip_address)
pipeline option.
+
+If you need to recreate or upgrade the runtime environment, do so in a
controlled way with visibility into changed dependencies:
+
+* Do not modify container images when running pipelines are still using them.
Review Comment:
```suggestion
* Do not modify container images when they are in use by running pipelines.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]