jorisvandenbossche edited a comment on issue #11239:
URL: https://github.com/apache/arrow/issues/11239#issuecomment-930053941


   One of the main advantages of the pickle protocol 5 is that it supports 
out-of-band buffers 
(https://docs.python.org/3/library/pickle.html#out-of-band-buffers, 
https://www.python.org/dev/peps/pep-0574/). This means that large buffers (i.e. 
the actual array data of the numpy arrays in this case) can be handled 
separately. 
   
   Using your example of the Persons class holding a numpy array, and adding a 
toy example of using out-of-band buffers:
   
   ```python
   # with pyrarow.(se)serialize
   persons_serialized = pa.serialize(PERSONS, context=context).to_buffer()
   %timeit pa.deserialize(persons_serialized, context=context)
   ...
   439 µs ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
   
   # with pickle 5
   persons_pickled = pickle.dumps(PERSONS, protocol=5)
   %timeit pickle.loads(persons_pickled)
   57.3 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
   
   # with pickle 5 with out-of-band buffers
   buffers = []
   persons_pickled = pickle.dumps(PERSONS, protocol=5, 
buffer_callback=buffers.append)
   %timeit pickle.loads(persons_pickled, buffers=buffers)
   119 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
   ```
   
   So here you can see that pyarrow.(de)serialize and pickle with out-of-band 
buffers are quite similar when deserializing (pickle protocol 5 is actually 
faster). 
   Of course, this is also not a fully fair comparison. In the last case with 
out-of-band buffers, the ndarrays are recreated zero-copy (no copy of the 
actual array data is made), which is not the case with pyarrow.serialize (which 
will copy the data when creating the serialized buffer; you can do a similar 
zero-copy (de)serialization with pyarrow using 
`pa.serialize(..).to_components()` / `pa.deserialize_components(..)`). 
   
   But in the end, the question is how you are using this (de)serialization. In 
your simplified example, you are serializing/deserializing the objects 
in-memory in the same python process. But that might not be your actual use 
case (in which case your timings are not necessarily fully representative), but 
you might be rather saving to a temporary file, or passing through to another 
process in a multiprocessing context, or...? For example, in such a 
multiprocessing context, the out-of-band buffers can be communicated more 
efficiently separately. 
   In addition, depending on the use case you might not (need to) use pickle 
directly, but you might be using it through another library. For example if you 
do multiprocessing via dask, it will already make use such out-of-band buffers 
for you)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to