I filed https://issues.apache.org/jira/browse/BEAM-12418 for this. Would you have any interest in taking it on?
On Tue, May 25, 2021 at 3:09 PM Brian Hulette <[email protected]> wrote: > Hm this would definitely be of interest for the DataFrame API, which is > shuffling pandas objects. This issue [1] confirms what you suggested above, > that pandas supports out-of-band pickling since DataFrames are mostly just > collections of numpy arrays. > > Brian > > [1] https://github.com/pandas-dev/pandas/issues/34244 > > On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer <[email protected]> wrote: > >> Beam's PickleCoder would need to be updated to pass the "buffer_callback" >> argument into pickle.dumps() and the "buffers" argument into >> pickle.loads(). I expect this would be relatively straightforward. >> >> Then it should "just work", assuming that data is stored in objects (like >> NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band >> Pickle protocol. >> >> >> On Tue, May 25, 2021 at 2:50 PM Brian Hulette <[email protected]> >> wrote: >> >>> I'm not aware of anyone looking at it. >>> >>> Will out-of-band pickling "just work" in Beam for types that implement >>> the correct interface in Python 3.8? >>> >>> On Tue, May 25, 2021 at 2:43 PM Evan Galpin <[email protected]> >>> wrote: >>> >>>> +1 >>>> >>>> FWIW I recently ran into the exact case you described (high >>>> serialization cost). The solution was to implement some not-so-intuitive >>>> alternative transforms in my case, but I would have very much appreciated >>>> faster serialization performance. >>>> >>>> Thanks, >>>> Evan >>>> >>>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer <[email protected]> wrote: >>>> >>>>> Has anyone looked into out of band pickling for Beam's Python SDK, >>>>> i.e., Pickle protocol version 5? >>>>> https://www.python.org/dev/peps/pep-0574/ >>>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers >>>>> >>>>> For Beam pipelines passing around NumPy arrays (or collections of >>>>> NumPy arrays, like pandas or Xarray) I've noticed that serialization costs >>>>> can be significant. Beam seems to currently incur at least one one (maybe >>>>> two) unnecessary memory copies. >>>>> >>>>> Pickle protocol version 5 exists for solving exactly this problem. You >>>>> can serialize collections of arbitrary Python objects in a fully streaming >>>>> fashion using memory buffers. This is a Python 3.8 feature, but the >>>>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been >>>>> supported by NumPy since version 1.16, released in January 2019. >>>>> >>>>> Cheers, >>>>> Stephan >>>>> >>>>
