[
https://issues.apache.org/jira/browse/BEAM-13314?focusedWorklogId=744367&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-744367
]
ASF GitHub Bot logged work on BEAM-13314:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 18/Mar/22 19:39
Start Date: 18/Mar/22 19:39
Worklog Time Spent: 10m
Work Description: AnandInguva commented on a change in pull request
#16938:
URL: https://github.com/apache/beam/pull/16938#discussion_r830301042
##########
File path:
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md
##########
@@ -136,17 +137,23 @@ If your pipeline uses non-Python packages (e.g. packages
that require installati
## Pre-building SDK container image
-In the pre-building step, we install pipeline dependencies on the container
image prior to the job submission. This would speed up the pipeline execution.\
-To use pre-building the dependencies from `requirements.txt` on the container
image. Follow the steps below.
-1. Provide the container engine. We support `local_docker` and
`cloud_build`(requires a GCP project with Cloud Build API enabled).
+In pipeline execution modes where a Beam runner launches SDK workers in Docker
containers, the additional pipeline dependencies (specified via
`--requirements_file` and other runtime options) are installed into the
containers at runtime. This can increase the worker startup time.
+ However, it may be possible to pre-build the SDK containers and perform the
dependency installation once before the workers start. To pre-build the
container image before pipeline submission, provide the pipeline options
mentioned below.
+1. Provide the container engine. We support `local_docker`(requires local
installation of Docker) and `cloud_build`(requires a GCP project with Cloud
Build API enabled).
+
+ --prebuild_sdk_container_engine=<container_engine>
+2. To pass a base image for pre-building dependencies, provide
`--sdk_container_image`. If not, Apache beam's base
[image](https://hub.docker.com/search?q=apache%2Fbeam&type=image) would be used.
- --prebuild_sdk_container_engine <execution_environment>
-2. To pass a base image for pre-building dependencies, enable this flag. If
not, apache beam's base image would be used.
+ --sdk_container_image=<location_to_base_image>
+3. If using `local_docker` engine, provide a URL for the remote registry to
which the image will be pushed by passing
+
+ --docker_registry_push_url=<remote_registry_url>
- --sdk_container_image <location_to_base_image>
-3. To push the container image, pre-built locally with `local_docker` , to a
remote repository(eg: docker registry), provide URL to the remote registry by
passing
+ # Example: --docker_registry_push_url=<registry_name>/beam
Review comment:
I added an example. would this be enough or is this more complicated?
##########
File path:
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md
##########
@@ -136,17 +137,23 @@ If your pipeline uses non-Python packages (e.g. packages
that require installati
## Pre-building SDK container image
-In the pre-building step, we install pipeline dependencies on the container
image prior to the job submission. This would speed up the pipeline execution.\
-To use pre-building the dependencies from `requirements.txt` on the container
image. Follow the steps below.
-1. Provide the container engine. We support `local_docker` and
`cloud_build`(requires a GCP project with Cloud Build API enabled).
+In pipeline execution modes where a Beam runner launches SDK workers in Docker
containers, the additional pipeline dependencies (specified via
`--requirements_file` and other runtime options) are installed into the
containers at runtime. This can increase the worker startup time.
+ However, it may be possible to pre-build the SDK containers and perform the
dependency installation once before the workers start. To pre-build the
container image before pipeline submission, provide the pipeline options
mentioned below.
+1. Provide the container engine. We support `local_docker`(requires local
installation of Docker) and `cloud_build`(requires a GCP project with Cloud
Build API enabled).
+
+ --prebuild_sdk_container_engine=<container_engine>
+2. To pass a base image for pre-building dependencies, provide
`--sdk_container_image`. If not, Apache beam's base
[image](https://hub.docker.com/search?q=apache%2Fbeam&type=image) would be used.
- --prebuild_sdk_container_engine <execution_environment>
-2. To pass a base image for pre-building dependencies, enable this flag. If
not, apache beam's base image would be used.
+ --sdk_container_image=<location_to_base_image>
+3. If using `local_docker` engine, provide a URL for the remote registry to
which the image will be pushed by passing
+
+ --docker_registry_push_url=<remote_registry_url>
- --sdk_container_image <location_to_base_image>
-3. To push the container image, pre-built locally with `local_docker` , to a
remote repository(eg: docker registry), provide URL to the remote registry by
passing
+ # Example: --docker_registry_push_url=<registry_name>/beam
+ # pre-built image will be pushed to the
<registry_name>/beam/beam_python_prebuilt_sdk:<unique_image_tag>
+ # <unique_image_tag> tag is generated by Beam SDK.
- --docker_registry_push_url <IMAGE_URL>
+ **NOTE:** `docker_registry_push_url` must be a remote registry.
Review comment:
@y1chi if the user uses pre-building and doesn't provide
`docker_registry_push_url`, what would happen in this case?
I recall it would fail with error something like this `Couldn't find the
Docker image`. If this is the case, we need to make sure that user provides a
remote registry URL. If the user doesn't provide it, can we fail the pipeline
prior to Job submission?
##########
File path:
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md
##########
@@ -45,6 +45,16 @@ If your pipeline uses public packages from the [Python
Package Index](https://py
The runner will use the `requirements.txt` file to install your additional
dependencies onto the remote workers.
**Important:** Remote workers will install all packages listed in the
`requirements.txt` file. Because of this, it's very important that you delete
non-PyPI packages from the `requirements.txt` file, as stated in step 2. If you
don't remove non-PyPI packages, the remote workers will fail when attempting to
install packages from sources that are unknown to them.
+> **NOTE**: An alternative to `pip freeze` is to use a library like
[pip-tools](https://github.com/jazzband/pip-tools) to compile the all the
dependencies required for the pipeline from a `--requirements_file`, where only
top-level dependencies are mentioned.
+## Custom Containers {#custom-containers}
Review comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 744367)
Time Spent: 7h 10m (was: 7h)
> Revise recommendations to manage Python pipeline dependencies.
> ---------------------------------------------------------------
>
> Key: BEAM-13314
> URL: https://issues.apache.org/jira/browse/BEAM-13314
> Project: Beam
> Issue Type: Improvement
> Components: sdk-py-core, website
> Reporter: Valentyn Tymofieiev
> Assignee: Anand Inguva
> Priority: P2
> Labels: usability
> Time Spent: 7h 10m
> Remaining Estimate: 0h
>
> The page
> https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
> recommends managing Python dependencies via requirements files.
> This approach is currently inefficient in light of introduction and adoption
> of PEP-517 by some packages, see:
> https://lists.apache.org/thread/trljnxo39c0cmff790yff3h8n5okqt3q and the
> rest of the thread, and does not mention Custom Containers or SDK prebuilding
> workflows.
>
> We should revise it and document best practices.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)