I'm not aware of anyone looking at it. Will out-of-band pickling "just work" in Beam for types that implement the correct interface in Python 3.8?
On Tue, May 25, 2021 at 2:43 PM Evan Galpin <[email protected]> wrote: > +1 > > FWIW I recently ran into the exact case you described (high serialization > cost). The solution was to implement some not-so-intuitive alternative > transforms in my case, but I would have very much appreciated faster > serialization performance. > > Thanks, > Evan > > On Tue, May 25, 2021 at 15:26 Stephan Hoyer <[email protected]> wrote: > >> Has anyone looked into out of band pickling for Beam's Python SDK, i.e., >> Pickle protocol version 5? >> https://www.python.org/dev/peps/pep-0574/ >> https://docs.python.org/3/library/pickle.html#out-of-band-buffers >> >> For Beam pipelines passing around NumPy arrays (or collections of NumPy >> arrays, like pandas or Xarray) I've noticed that serialization costs can be >> significant. Beam seems to currently incur at least one one (maybe two) >> unnecessary memory copies. >> >> Pickle protocol version 5 exists for solving exactly this problem. You >> can serialize collections of arbitrary Python objects in a fully streaming >> fashion using memory buffers. This is a Python 3.8 feature, but the >> "pickle5" library provides a backport to Python 3.6 and 3.7. It has been >> supported by NumPy since version 1.16, released in January 2019. >> >> Cheers, >> Stephan >> >
