[GitHub] [beam] tvalentyn commented on a diff in pull request #26331: Adding info on picklers to docs [follow-on]

via GitHub Fri, 21 Apr 2023 07:07:41 -0700


tvalentyn commented on code in PR #26331:
URL: https://github.com/apache/beam/pull/26331#discussion_r1173811477



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -134,20 +134,29 @@ If your pipeline uses non-Python packages (e.g. packages 
that require installati
 
 **Note:** Because custom commands execute after the dependencies for your 
workflow are installed (by `pip`), you should omit the PyPI package dependency 
from the pipeline's `requirements.txt` file and from the `install_requires` 
parameter in the `setuptools.setup()` call of your `setup.py` file.
 
-## Pre-building SDK container image
+## Pre-building SDK Container Image
 
 In pipeline execution modes where a Beam runner launches SDK workers in Docker 
containers, the additional pipeline dependencies (specified via 
`--requirements_file` and other runtime options) are installed into the 
containers at runtime. This can increase the worker startup time.
 However, it may be possible to pre-build the SDK containers and perform the 
dependency installation once before the workers start with 
`--prebuild_sdk_container_engine`. For instructions of how to use pre-building 
with Google Cloud
 Dataflow, see [Pre-building the python SDK custom container image with extra 
dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
 
-## Pickling and Managing Main Session
+## Pickling and Managing the Main Session
 
-Pickling in the Python SDK is set up to pickle the state of the global 
namespace. By default, global imports, functions, and variables defined in the 
main session are not saved during the serialization of a Beam job.
-Thus, one might encounter an unexpected `NameError` when running a `DoFn` on 
any remote runner using portability. To resolve this, manage the main session by
-simply setting the main session. This will load the pickled state of the 
global namespace onto the Dataflow workers.
+When the Python SDK submits the pipeline for execution to a remote runner, the 
pipeline contents, such as transform user code, is serialized (or pickled) into 
a bytecode using
+libraries that perform the serialization (also called picklers).  The default 
pickler library used by Beam is `dill`.
+To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` 
pipeline option.

Review Comment:
   ```suggestion
   To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` 
pipeline option. The `cloudpickle` support is currently 
[experimental](https://github.com/apache/beam/issues/21298). 
   ```



##########
website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md:
##########
@@ -134,20 +134,29 @@ If your pipeline uses non-Python packages (e.g. packages 
that require installati
 
 **Note:** Because custom commands execute after the dependencies for your 
workflow are installed (by `pip`), you should omit the PyPI package dependency 
from the pipeline's `requirements.txt` file and from the `install_requires` 
parameter in the `setuptools.setup()` call of your `setup.py` file.
 
-## Pre-building SDK container image
+## Pre-building SDK Container Image
 
 In pipeline execution modes where a Beam runner launches SDK workers in Docker 
containers, the additional pipeline dependencies (specified via 
`--requirements_file` and other runtime options) are installed into the 
containers at runtime. This can increase the worker startup time.
 However, it may be possible to pre-build the SDK containers and perform the 
dependency installation once before the workers start with 
`--prebuild_sdk_container_engine`. For instructions of how to use pre-building 
with Google Cloud
 Dataflow, see [Pre-building the python SDK custom container image with extra 
dependencies](https://cloud.google.com/dataflow/docs/guides/using-custom-containers#prebuild).
 
 **NOTE**: This feature is available only for the `Dataflow Runner v2`.
 
-## Pickling and Managing Main Session
+## Pickling and Managing the Main Session
 
-Pickling in the Python SDK is set up to pickle the state of the global 
namespace. By default, global imports, functions, and variables defined in the 
main session are not saved during the serialization of a Beam job.
-Thus, one might encounter an unexpected `NameError` when running a `DoFn` on 
any remote runner using portability. To resolve this, manage the main session by
-simply setting the main session. This will load the pickled state of the 
global namespace onto the Dataflow workers.
+When the Python SDK submits the pipeline for execution to a remote runner, the 
pipeline contents, such as transform user code, is serialized (or pickled) into 
a bytecode using
+libraries that perform the serialization (also called picklers).  The default 
pickler library used by Beam is `dill`.
+To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` 
pipeline option.

Review Comment:
   ```suggestion
   To use the `cloudpickle` pickler, supply the `--pickle_library=cloudpickle` 
pipeline option. The `cloudpickle` support is currently 
[experimental](https://github.com/apache/beam/issues/21298).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] tvalentyn commented on a diff in pull request #26331: Adding info on picklers to docs [follow-on]

Reply via email to