On Tue, Apr 29, 2025 at 7:51 PM Joey Tran <joey.t...@schrodinger.com> wrote:
>
> Does cloudpickle make --save_main_session unnecessary? As in, will more 
> transforms defined in __main__ "just work"?

Yes. Or at least it "just works" much more often. (There may still be
corner cases, but I haven't run into them...)

I, for one, am excited to see this change. Thanks, Claude, for taking
the lead on this.

> If so, I can see why that's worthwhile. I've had a _ton_ of issues with this, 
> especially with new users of beam at my company. Explaining main session and 
> why random things throw unpickling errors or why their transform is throwing 
> Name errors has been a very painful experience, especially since it usually 
> happens with users first experiences
>
> On Tue, Apr 29, 2025, 6:14 PM Valentyn Tymofieiev via dev 
> <dev@beam.apache.org> wrote:
>>
>> There are several reasons:
>>  - wide adoption in data processing community , see initial discussion: [1]
>>  - expectations on cloudpickle having a larger number of maintainers and 
>> contributors.
>>  - new releases of dill had breaking changes[2], which made adoption of a 
>> new version challenging.
>>  - cloudpickle is easier to vendor - it is a single file and unlike dill, 
>> does not create side-effects in the global namespace, which might conflict 
>> with any unvendored version. vendoring allows to eliminate a common failure 
>> mode when the pickler library is different at submission and runtime.
>>  - previously, some bugs and feature requests Beam requested in dill took a 
>> long time to be implemented and released.
>>
>> [1] https://lists.apache.org/thread/dvxvclhok0fx48955x6szvw4kotxh87n
>> [2] https://github.com/apache/beam/issues/22893#issuecomment-1502354194
>>
>> On Mon, Apr 28, 2025 at 4:00 PM Joey Tran <joey.t...@schrodinger.com> wrote:
>>>
>>> Naive question, but why is beam upgrading to cloudpickle?
>>>
>>> I saw this doc:
>>> https://docs.google.com/document/d/1G5Q0ckX5sKQRQD1yEkLCPQL7N6B-AL9Cb1p0zlOOfQU/edit?tab=t.0
>>>
>>> Is the main reason because cloudpickle is more actively maintained?
>>>
>>>
>>> On Mon, Apr 28, 2025 at 6:51 PM Claudius van der Merwe <claud...@vdmza.com> 
>>> wrote:
>>>>
>>>> Hi Beam Devs,
>>>>
>>>>
>>>> I am making progress on making cloudpickle the default pickling library 
>>>> and removing the strict dependency on dill as outlined in 
>>>> https://s.apache.org/beam-cloudpickle-next-steps.
>>>>
>>>>
>>>> The current plan  is to:
>>>>
>>>>
>>>> 1. Make cloudpickle the default library in Beam 2.65.0 release (see 
>>>> https://github.com/apache/beam/pull/34695). Users will be able to specify 
>>>> pickle_library='dill' without any additional requirements. There will 
>>>> still be a hard dependency on dill (blocked by #2) but it is a step in the 
>>>> right direction.
>>>>
>>>>
>>>> 2. Remove the strict dependency on dill in Beam 2.66.0 release. Dill is 
>>>> directly used for coder's encoding types in FastPrimitivesCoderImpl 
>>>> [1][2]. I prefer to submit a fix for this after the branch cut so we have 
>>>> more time to identify any issues.
>>>>
>>>>
>>>> Coudpickle has some fundamentally different pickling behavior to dill that 
>>>> is likely to break:
>>>>
>>>> Unittests that rely on globals
>>>>
>>>> This can be fixed by using apache_beam.utils.shared [3]
>>>>
>>>> Closures and dynamic classes that reference unpicklable globals
>>>>
>>>> This can be fixed by defining functions in the top level, and using 
>>>> functools.partial to bind parameters if necessary
>>>>
>>>>
>>>> [1] 
>>>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/coders/coder_impl.py#L529
>>>>
>>>> [2] 
>>>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/coders/coder_impl.py#L595
>>>>
>>>> [3] 
>>>> https://github.com/apache/beam/blob/b9fa49a9827dd28349e382f479ebd1a8bbe27d07/sdks/python/apache_beam/internal/cloudpickle_pickler_test.py#L54
>>>>
>>>>
>>>> I'd appreciate any feedback or concerns.
>>>>
>>>>
>>>> Best,
>>>>
>>>> Claude
>>>>
>>>>

Reply via email to