Hi Stephan,

Thanks for reaching out. We first considered switching to cloudpickle when
adding Python 3 support[1], and there is a tracking issue[2]. We were able
to fix or work around missing Py3 in dill, features although some are still
not working for us [3].
I agree that Beam can and should support cloudpickle as a pickler.
Practically, we can make cloudpickle the default pickler starting from a
particular python version, for example we are planning to add Python 3.9
support and we can try to make cloudpickle the default pickler for this
version to avoid breaking users while ironing out rough edges.

My main concern is client-server version range compatibility of the
pickler. When SDK creates the job representation, it serializes the objects
using the pickler used on the user's machine. When SDK deserializes the
objects on the Runner side, it uses the pickler installed on the runner,
for example it can be a dill version installed the docker container
provided by Beam or Dataflow. We have been burned in the past by having an
open version bound for the pickler in Beam's requirements: client side
would pick the newest version, but runner container would have a somewhat
older version, either because the container did not have the new version,
or because some pipeline dependency wanted to downgrade dill. Older version
of pickler did not correctly deserialize new pickles. I suspect cloudpickle
may have the same problem. A solution was to have a very tight version
range for the pickler in SDK's requirements [4]. Given that dill is not a
popular dependency, the tight range did not create much friction for Beam
users. I think with cloudpickle we will not be able have a tight range.  We
could solve this problem by passing the version of pickler used at job
submission, and have a check on the runner to make sure that the client
version is not newer than the runner's version. Additionally, we should
make sure cloudpickle is backwards compatible (newer version can
deserialize objects created by older version).

[1]
https://lists.apache.org/thread.html/d431664a3fc1039faa01c10e2075659288aec5961c7b4b59d9f7b889%40%3Cdev.beam.apache.org%3E
[2] https://issues.apache.org/jira/browse/BEAM-8123
[3] https://github.com/uqfoundation/dill/issues/300#issuecomment-525409202
[4]
https://github.com/apache/beam/blob/master/sdks/python/setup.py#L138-L143

On Thu, Apr 29, 2021 at 8:04 PM Stephan Hoyer <[email protected]> wrote:

> cloudpickle [1] and dill [2] are two Python packages that implement
> extensions of Python's pickle protocol for arbitrary objects. Beam
> currently uses dill, but I'm wondering if we could consider additionally or
> alternatively use cloudpickle instead.
>
> Overall, cloudpickle seems to be a more popular choice for extended pickle
> support in distributing computing in Python, e.g., it's used by Spark, Dask
> and joblib.
>
> One of the major differences between cloudpickle and dill is how they
> handle pickling global variables (such as Python modules) that are referred
> to by a function:
> - Dill doesn't serialize globals. If you want to save globals, you need to
> call dill.dump_session(). This is what the "save_main_session" flag does in
> Beam.
> - Cloudpickle takes a different approach. It introspects which global
> variables are used by a function, and creates a closure around the
> serialized function that only contains these variables.
>
> The cloudpickle approach results in larger serialized functions, but it's
> also much more robust, because the required globals are included by
> default. In contrast, with dill, one either needs to save *all *globals
> or none. This is repeated pain-point for Beam Python users [3]:
> - Saving all globals can be overly aggressive, particularly in notebooks
> where users may have incidentally created large objects.
> - Alternatively, users can avoid using global variables entirely, but this
> makes defining ad-hoc pipelines very awkward. Mapped over functions need to
> be imported from other modules, or need to have their imports defined
> inside the function itself.
>
> I'd love to see an option to use cloudpickle in Beam instead of dill, and
> to consider switching over entirely. Cloudpickle would allow Beam users to
> write readable code in the way they expect, without needing to worry about
> the confusing and potentially problematic "save_main_session" flag.
>
> Any thoughts?
>
> Cheers,
> Stephan
>
> [1] https://github.com/cloudpipe/cloudpickle
> [2] https://github.com/uqfoundation/dill
> [3]
> https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors
>
>

Reply via email to