cloudpickle [1] and dill [2] are two Python packages that implement
extensions of Python's pickle protocol for arbitrary objects. Beam
currently uses dill, but I'm wondering if we could consider additionally or
alternatively use cloudpickle instead.

Overall, cloudpickle seems to be a more popular choice for extended pickle
support in distributing computing in Python, e.g., it's used by Spark, Dask
and joblib.

One of the major differences between cloudpickle and dill is how they
handle pickling global variables (such as Python modules) that are referred
to by a function:
- Dill doesn't serialize globals. If you want to save globals, you need to
call dill.dump_session(). This is what the "save_main_session" flag does in
Beam.
- Cloudpickle takes a different approach. It introspects which global
variables are used by a function, and creates a closure around the
serialized function that only contains these variables.

The cloudpickle approach results in larger serialized functions, but it's
also much more robust, because the required globals are included by
default. In contrast, with dill, one either needs to save *all *globals or
none. This is repeated pain-point for Beam Python users [3]:
- Saving all globals can be overly aggressive, particularly in notebooks
where users may have incidentally created large objects.
- Alternatively, users can avoid using global variables entirely, but this
makes defining ad-hoc pipelines very awkward. Mapped over functions need to
be imported from other modules, or need to have their imports defined
inside the function itself.

I'd love to see an option to use cloudpickle in Beam instead of dill, and
to consider switching over entirely. Cloudpickle would allow Beam users to
write readable code in the way they expect, without needing to worry about
the confusing and potentially problematic "save_main_session" flag.

Any thoughts?

Cheers,
Stephan

[1] https://github.com/cloudpipe/cloudpickle
[2] https://github.com/uqfoundation/dill
[3]
https://cloud.google.com/dataflow/docs/resources/faq#how_do_i_handle_nameerrors

Reply via email to