Does cloudpickle make --save_main_session unnecessary? As in, will more
transforms defined in __main__ "just work"?

If so, I can see why that's worthwhile. I've had a _ton_ of issues with
this, especially with new users of beam at my company. Explaining main
session and why random things throw unpickling errors or why their
transform is throwing Name errors has been a very painful experience,
especially since it usually happens with users first experiences

On Tue, Apr 29, 2025, 6:14 PM Valentyn Tymofieiev via dev <
dev@beam.apache.org> wrote:

> There are several reasons:
>  - wide adoption in data processing community , see initial discussion: [1]
>  - expectations on cloudpickle having a larger number of maintainers and
> contributors.
>  - new releases of dill had breaking changes[2], which made adoption of a
> new version challenging.
>  - cloudpickle is easier to vendor - it is a single file and unlike dill,
> does not create side-effects in the global namespace, which might conflict
> with any unvendored version. vendoring allows to eliminate a common failure
> mode when the pickler library is different at submission and runtime.
>  - previously, some bugs and feature requests Beam requested in dill took
> a long time to be implemented and released.
>
> [1] https://lists.apache.org/thread/dvxvclhok0fx48955x6szvw4kotxh87n
> [2] https://github.com/apache/beam/issues/22893#issuecomment-1502354194
>
> On Mon, Apr 28, 2025 at 4:00 PM Joey Tran <joey.t...@schrodinger.com>
> wrote:
>
>> Naive question, but why is beam upgrading to cloudpickle?
>>
>> I saw this doc:
>>
>> https://docs.google.com/document/d/1G5Q0ckX5sKQRQD1yEkLCPQL7N6B-AL9Cb1p0zlOOfQU/edit?tab=t.0
>>
>> Is the main reason because cloudpickle is more actively maintained?
>>
>>
>> On Mon, Apr 28, 2025 at 6:51 PM Claudius van der Merwe <
>> claud...@vdmza.com> wrote:
>>
>>> Hi Beam Devs,
>>>
>>> I am making progress on making cloudpickle the default pickling library
>>> and removing the strict dependency on dill as outlined in
>>> https://s.apache.org/beam-cloudpickle-next-steps.
>>>
>>> The current plan  is to:
>>>
>>> 1. Make cloudpickle the default library in Beam 2.65.0 release (see
>>> https://github.com/apache/beam/pull/34695). Users will be able to
>>> specify pickle_library='dill' without any additional requirements. There
>>> will still be a hard dependency on dill (blocked by #2) but it is a step in
>>> the right direction.
>>>
>>> 2. Remove the strict dependency on dill in Beam 2.66.0 release. Dill is
>>> directly used for coder's encoding types in FastPrimitivesCoderImpl [1][2].
>>> I prefer to submit a fix for this after the branch cut so we have more time
>>> to identify any issues.
>>>
>>> Coudpickle has some fundamentally different pickling behavior to dill
>>> that is likely to break:
>>>
>>>    -
>>>
>>>    Unittests that rely on globals
>>>    -
>>>
>>>       This can be fixed by using apache_beam.utils.shared [3]
>>>       -
>>>
>>>    Closures and dynamic classes that reference unpicklable globals
>>>    -
>>>
>>>       This can be fixed by defining functions in the top level, and
>>>       using functools.partial to bind parameters if necessary
>>>
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/coders/coder_impl.py#L529
>>>
>>> [2]
>>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/coders/coder_impl.py#L595
>>>
>>> [3]
>>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/internal/cloudpickle_pickler_test.py#L54
>>>
>>>
>>> I'd appreciate any feedback or concerns.
>>>
>>>
>>> Best,
>>>
>>> Claude
>>>
>>>

Reply via email to