tvalentyn commented on code in PR #26331: URL: https://github.com/apache/beam/pull/26331#discussion_r1173811477
########## website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md: ########## @@ -134,20 +134,29 @@ If your pipeline uses non-Python packages (e.g. packages that require installati **Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file. -## Pre-building SDK container image +## Pre-building SDK Container Image In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via `--requirements_file` and other runtime options) are installed into the containers at runtime. This can increase the worker startup time. However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start with `--prebuild_sdk_container_engine`. For instructions of how to use pre-building with Google Cloud Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild). **NOTE**: This feature is available only for the `Dataflow Runner v2`. -## Pickling and Managing Main Session +## Pickling and Managing the Main Session -Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Beam job. -Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner using portability. To resolve this, manage the main session by -simply setting the main session. This will load the pickled state of the global namespace onto the Dataflow workers. +When the Python SDK submits the pipeline for execution to a remote runner, the pipeline contents, such as transform user code, is serialized (or pickled) into a bytecode using +libraries that perform the serialization (also called picklers). The default pickler library used by Beam is `dill`. +To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option. Review Comment: ```suggestion To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option. The `cloudpickle` support is currently [experimental](https://github.com/apache/beam/issues/21298). ``` ########## website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md: ########## @@ -134,20 +134,29 @@ If your pipeline uses non-Python packages (e.g. packages that require installati **Note:** Because custom commands execute after the dependencies for your workflow are installed (by `pip`), you should omit the PyPI package dependency from the pipeline's `requirements.txt` file and from the `install_requires` parameter in the `setuptools.setup()` call of your `setup.py` file. -## Pre-building SDK container image +## Pre-building SDK Container Image In pipeline execution modes where a Beam runner launches SDK workers in Docker containers, the additional pipeline dependencies (specified via `--requirements_file` and other runtime options) are installed into the containers at runtime. This can increase the worker startup time. However, it may be possible to pre-build the SDK containers and perform the dependency installation once before the workers start with `--prebuild_sdk_container_engine`. For instructions of how to use pre-building with Google Cloud Dataflow, see [Pre-building the python SDK custom container image with extra dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild). **NOTE**: This feature is available only for the `Dataflow Runner v2`. -## Pickling and Managing Main Session +## Pickling and Managing the Main Session -Pickling in the Python SDK is set up to pickle the state of the global namespace. By default, global imports, functions, and variables defined in the main session are not saved during the serialization of a Beam job. -Thus, one might encounter an unexpected `NameError` when running a `DoFn` on any remote runner using portability. To resolve this, manage the main session by -simply setting the main session. This will load the pickled state of the global namespace onto the Dataflow workers. +When the Python SDK submits the pipeline for execution to a remote runner, the pipeline contents, such as transform user code, is serialized (or pickled) into a bytecode using +libraries that perform the serialization (also called picklers). The default pickler library used by Beam is `dill`. +To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option. Review Comment: ```suggestion To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` pipeline option. The `cloudpickle` support is currently [experimental](https://github.com/apache/beam/issues/21298). ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
