Hm this would definitely be of interest for the DataFrame API, which is shuffling pandas objects. This issue [1] confirms what you suggested above, that pandas supports out-of-band pickling since DataFrames are mostly just collections of numpy arrays.
Brian [1] https://github.com/pandas-dev/pandas/issues/34244 On Tue, May 25, 2021 at 2:59 PM Stephan Hoyer <[email protected]> wrote: > Beam's PickleCoder would need to be updated to pass the "buffer_callback" > argument into pickle.dumps() and the "buffers" argument into > pickle.loads(). I expect this would be relatively straightforward. > > Then it should "just work", assuming that data is stored in objects (like > NumPy arrays or wrappers of NumPy arrays) that implement the out-of-band > Pickle protocol. > > > On Tue, May 25, 2021 at 2:50 PM Brian Hulette <[email protected]> wrote: > >> I'm not aware of anyone looking at it. >> >> Will out-of-band pickling "just work" in Beam for types that implement >> the correct interface in Python 3.8? >> >> On Tue, May 25, 2021 at 2:43 PM Evan Galpin <[email protected]> >> wrote: >> >>> +1 >>> >>> FWIW I recently ran into the exact case you described (high >>> serialization cost). The solution was to implement some not-so-intuitive >>> alternative transforms in my case, but I would have very much appreciated >>> faster serialization performance. >>> >>> Thanks, >>> Evan >>> >>> On Tue, May 25, 2021 at 15:26 Stephan Hoyer <[email protected]> wrote: >>> >>>> Has anyone looked into out of band pickling for Beam's Python SDK, >>>> i.e., Pickle protocol version 5? >>>> https://www.python.org/dev/peps/pep-0574/ >>>> https://docs.python.org/3/library/pickle.html#out-of-band-buffers >>>> >>>> For Beam pipelines passing around NumPy arrays (or collections of NumPy >>>> arrays, like pandas or Xarray) I've noticed that serialization costs can be >>>> significant. Beam seems to currently incur at least one one (maybe two) >>>> unnecessary memory copies. >>>> >>>> Pickle protocol version 5 exists for solving exactly this problem. You >>>> can serialize collections of arbitrary Python objects in a fully streaming >>>> fashion using memory buffers. This is a Python 3.8 feature, but the >>>> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been >>>> supported by NumPy since version 1.16, released in January 2019. >>>> >>>> Cheers, >>>> Stephan >>>> >>>
