There are several reasons:
 - wide adoption in data processing community , see initial discussion: [1]
 - expectations on cloudpickle having a larger number of maintainers and
contributors.
 - new releases of dill had breaking changes[2], which made adoption of a
new version challenging.
 - cloudpickle is easier to vendor - it is a single file and unlike dill,
does not create side-effects in the global namespace, which might conflict
with any unvendored version. vendoring allows to eliminate a common failure
mode when the pickler library is different at submission and runtime.
 - previously, some bugs and feature requests Beam requested in dill took a
long time to be implemented and released.

[1] https://lists.apache.org/thread/dvxvclhok0fx48955x6szvw4kotxh87n
[2] https://github.com/apache/beam/issues/22893#issuecomment-1502354194

On Mon, Apr 28, 2025 at 4:00 PM Joey Tran <joey.t...@schrodinger.com> wrote:

> Naive question, but why is beam upgrading to cloudpickle?
>
> I saw this doc:
>
> https://docs.google.com/document/d/1G5Q0ckX5sKQRQD1yEkLCPQL7N6B-AL9Cb1p0zlOOfQU/edit?tab=t.0
>
> Is the main reason because cloudpickle is more actively maintained?
>
>
> On Mon, Apr 28, 2025 at 6:51 PM Claudius van der Merwe <claud...@vdmza.com>
> wrote:
>
>> Hi Beam Devs,
>>
>> I am making progress on making cloudpickle the default pickling library
>> and removing the strict dependency on dill as outlined in
>> https://s.apache.org/beam-cloudpickle-next-steps.
>>
>> The current plan  is to:
>>
>> 1. Make cloudpickle the default library in Beam 2.65.0 release (see
>> https://github.com/apache/beam/pull/34695). Users will be able to
>> specify pickle_library='dill' without any additional requirements. There
>> will still be a hard dependency on dill (blocked by #2) but it is a step in
>> the right direction.
>>
>> 2. Remove the strict dependency on dill in Beam 2.66.0 release. Dill is
>> directly used for coder's encoding types in FastPrimitivesCoderImpl [1][2].
>> I prefer to submit a fix for this after the branch cut so we have more time
>> to identify any issues.
>>
>> Coudpickle has some fundamentally different pickling behavior to dill
>> that is likely to break:
>>
>>    -
>>
>>    Unittests that rely on globals
>>    -
>>
>>       This can be fixed by using apache_beam.utils.shared [3]
>>       -
>>
>>    Closures and dynamic classes that reference unpicklable globals
>>    -
>>
>>       This can be fixed by defining functions in the top level, and
>>       using functools.partial to bind parameters if necessary
>>
>>
>> [1]
>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/coders/coder_impl.py#L529
>>
>> [2]
>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/coders/coder_impl.py#L595
>>
>> [3]
>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/internal/cloudpickle_pickler_test.py#L54
>>
>>
>> I'd appreciate any feedback or concerns.
>>
>>
>> Best,
>>
>> Claude
>>
>>

Reply via email to