I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would
you have any interest in taking it on?

On Tue, May 25, 2021 at 3:09 PM Brian Hulette <[email protected]> wrote:

> Hm this would definitely be of interest for the DataFrame API, which is
> shuffling pandas objects. This issue [1] confirms what you suggested above,
> that pandas supports out-of-band pickling since DataFrames are mostly just
> collections of numpy arrays.
>
> Brian
>
> [1] https://github.com/pandas-dev/pandas/issues/34244
>
> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer <[email protected]> wrote:
>
>> Beam's PickleCoder would need to be updated to pass the "buffer_callback"
>> argument into pickle.dumps() and the "buffers" argument into
>> pickle.loads(). I expect this would be relatively straightforward.
>>
>> Then it should "just work", assuming that data is stored in objects (like
>> NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band
>> Pickle protocol.
>>
>>
>> On Tue, May 25, 2021 at 2:50 PM Brian Hulette <[email protected]>
>> wrote:
>>
>>> I'm not aware of anyone looking at it.
>>>
>>> Will out-of-band pickling "just work" in Beam for types that implement
>>> the correct interface in Python 3.8?
>>>
>>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin <[email protected]>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> FWIW I recently ran into the exact case you described (high
>>>> serialization cost). The solution was to implement some not-so-intuitive
>>>> alternative transforms in my case, but I would have very much appreciated
>>>> faster serialization performance.
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer <[email protected]> wrote:
>>>>
>>>>> Has anyone looked into out of band pickling for Beam's Python SDK,
>>>>> i.e., Pickle protocol version 5?
>>>>> https://www.python.org/dev/peps/pep-0574/
>>>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>>>
>>>>> For Beam pipelines passing around NumPy arrays (or collections of
>>>>> NumPy arrays, like pandas or Xarray) I've noticed that serialization costs
>>>>> can be significant. Beam seems to currently incur at least one one (maybe
>>>>> two) unnecessary memory copies.
>>>>>
>>>>> Pickle protocol version 5 exists for solving exactly this problem. You
>>>>> can serialize collections of arbitrary Python objects in a fully streaming
>>>>> fashion using memory buffers. This is a Python 3.8 feature, but the
>>>>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
>>>>> supported by NumPy since version 1.16, released in January 2019.
>>>>>
>>>>> Cheers,
>>>>> Stephan
>>>>>
>>>>

Reply via email to