Hm this would definitely be of interest for the DataFrame API, which is
shuffling pandas objects. This issue [1] confirms what you suggested above,
that pandas supports out-of-band pickling since DataFrames are mostly just
collections of numpy arrays.

Brian

[1] https://github.com/pandas-dev/pandas/issues/34244

On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer <[email protected]> wrote:

> Beam's PickleCoder would need to be updated to pass the "buffer_callback"
> argument into pickle.dumps() and the "buffers" argument into
> pickle.loads(). I expect this would be relatively straightforward.
>
> Then it should "just work", assuming that data is stored in objects (like
> NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band
> Pickle protocol.
>
>
> On Tue, May 25, 2021 at 2:50 PM Brian Hulette <[email protected]> wrote:
>
>> I'm not aware of anyone looking at it.
>>
>> Will out-of-band pickling "just work" in Beam for types that implement
>> the correct interface in Python 3.8?
>>
>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin <[email protected]>
>> wrote:
>>
>>> +1
>>>
>>> FWIW I recently ran into the exact case you described (high
>>> serialization cost). The solution was to implement some not-so-intuitive
>>> alternative transforms in my case, but I would have very much appreciated
>>> faster serialization performance.
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer <[email protected]> wrote:
>>>
>>>> Has anyone looked into out of band pickling for Beam's Python SDK,
>>>> i.e., Pickle protocol version 5?
>>>> https://www.python.org/dev/peps/pep-0574/
>>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>>
>>>> For Beam pipelines passing around NumPy arrays (or collections of NumPy
>>>> arrays, like pandas or Xarray) I've noticed that serialization costs can be
>>>> significant. Beam seems to currently incur at least one one (maybe two)
>>>> unnecessary memory copies.
>>>>
>>>> Pickle protocol version 5 exists for solving exactly this problem. You
>>>> can serialize collections of arbitrary Python objects in a fully streaming
>>>> fashion using memory buffers. This is a Python 3.8 feature, but the
>>>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been
>>>> supported by NumPy since version 1.16, released in January 2019.
>>>>
>>>> Cheers,
>>>> Stephan
>>>>
>>>

Reply via email to