I'm unlikely to have bandwidth to take this one on, but I do think it would be quite valuable!
On Thu, May 27, 2021 at 4:42 PM Brian Hulette <[email protected]> wrote: > I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would > you have any interest in taking it on? > > On Tue, May 25, 2021 at 3:09 PM Brian Hulette <[email protected]> wrote: > >> Hm this would definitely be of interest for the DataFrame API, which is >> shuffling pandas objects. This issue [1] confirms what you suggested above, >> that pandas supports out-of-band pickling since DataFrames are mostly just >> collections of numpy arrays. >> >> Brian >> >> [1] https://github.com/pandas-dev/pandas/issues/34244 >> >> On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer <[email protected]> wrote: >> >>> Beam's PickleCoder would need to be updated to pass the >>> "buffer_callback" argument into pickle.dumps() and the "buffers" argument >>> into pickle.loads(). I expect this would be relatively straightforward. >>> >>> Then it should "just work", assuming that data is stored in objects >>> (like NumPy arrays or wrappers of NumPy arrays) that implement the >>> out-of-band Pickle protocol. >>> >>> >>> On Tue, May 25, 2021 at 2:50 PM Brian Hulette <[email protected]> >>> wrote: >>> >>>> I'm not aware of anyone looking at it. >>>> >>>> Will out-of-band pickling "just work" in Beam for types that implement >>>> the correct interface in Python 3.8? >>>> >>>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin <[email protected]> >>>> wrote: >>>> >>>>> +1 >>>>> >>>>> FWIW I recently ran into the exact case you described (high >>>>> serialization cost). The solution was to implement some not-so-intuitive >>>>> alternative transforms in my case, but I would have very much appreciated >>>>> faster serialization performance. >>>>> >>>>> Thanks, >>>>> Evan >>>>> >>>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer <[email protected]> wrote: >>>>> >>>>>> Has anyone looked into out of band pickling for Beam's Python SDK, >>>>>> i.e., Pickle protocol version 5? >>>>>> https://www.python.org/dev/peps/pep-0574/ >>>>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers >>>>>> >>>>>> For Beam pipelines passing around NumPy arrays (or collections of >>>>>> NumPy arrays, like pandas or Xarray) I've noticed that serialization >>>>>> costs >>>>>> can be significant. Beam seems to currently incur at least one one (maybe >>>>>> two) unnecessary memory copies. >>>>>> >>>>>> Pickle protocol version 5 exists for solving exactly this problem. >>>>>> You can serialize collections of arbitrary Python objects in a fully >>>>>> streaming fashion using memory buffers. This is a Python 3.8 feature, but >>>>>> the "pickle5" library provides a backport to Python 3.6 and 3.7. It has >>>>>> been supported by NumPy since version 1.16, released in January 2019. >>>>>> >>>>>> Cheers, >>>>>> Stephan >>>>>> >>>>>
