I'm unlikely to have bandwidth to take this one on, but I do think it would
be quite valuable!

On Thu, May 27, 2021 at 4:42 PM Brian Hulette <[email protected]> wrote:

> I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would
> you have any interest in taking it on?
>
> On Tue, May 25, 2021 at 3:09 PM Brian Hulette <[email protected]> wrote:
>
>> Hm this would definitely be of interest for the DataFrame API, which is
>> shuffling pandas objects. This issue [1] confirms what you suggested above,
>> that pandas supports out-of-band pickling since DataFrames are mostly just
>> collections of numpy arrays.
>>
>> Brian
>>
>> [1] https://github.com/pandas-dev/pandas/issues/34244
>>
>> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer <[email protected]> wrote:
>>
>>> Beam's PickleCoder would need to be updated to pass the
>>> "buffer_callback" argument into pickle.dumps() and the "buffers" argument
>>> into pickle.loads(). I expect this would be relatively straightforward.
>>>
>>> Then it should "just work", assuming that data is stored in objects
>>> (like NumPy arrays or wrappers of NumPy arrays) that implement the
>>> out-of-band Pickle protocol.
>>>
>>>
>>> On Tue, May 25, 2021 at 2:50 PM Brian Hulette <[email protected]>
>>> wrote:
>>>
>>>> I'm not aware of anyone looking at it.
>>>>
>>>> Will out-of-band pickling "just work" in Beam for types that implement
>>>> the correct interface in Python 3.8?
>>>>
>>>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin <[email protected]>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> FWIW I recently ran into the exact case you described (high
>>>>> serialization cost). The solution was to implement some not-so-intuitive
>>>>> alternative transforms in my case, but I would have very much appreciated
>>>>> faster serialization performance.
>>>>>
>>>>> Thanks,
>>>>> Evan
>>>>>
>>>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer <[email protected]> wrote:
>>>>>
>>>>>> Has anyone looked into out of band pickling for Beam's Python SDK,
>>>>>> i.e., Pickle protocol version 5?
>>>>>> https://www.python.org/dev/peps/pep-0574/
>>>>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers
>>>>>>
>>>>>> For Beam pipelines passing around NumPy arrays (or collections of
>>>>>> NumPy arrays, like pandas or Xarray) I've noticed that serialization 
>>>>>> costs
>>>>>> can be significant. Beam seems to currently incur at least one one (maybe
>>>>>> two) unnecessary memory copies.
>>>>>>
>>>>>> Pickle protocol version 5 exists for solving exactly this problem.
>>>>>> You can serialize collections of arbitrary Python objects in a fully
>>>>>> streaming fashion using memory buffers. This is a Python 3.8 feature, but
>>>>>> the "pickle5" library provides a backport to Python 3.6 and 3.7. It has
>>>>>> been supported by NumPy since version 1.16, released in January 2019.
>>>>>>
>>>>>> Cheers,
>>>>>> Stephan
>>>>>>
>>>>>

Reply via email to