jorisvandenbossche commented on issue #11239: URL: https://github.com/apache/arrow/issues/11239#issuecomment-930053941
One of the main advantages of the pickle protocol 5 is that it supports out-of-band buffers (https://docs.python.org/3/library/pickle.html#out-of-band-buffers, https://www.python.org/dev/peps/pep-0574/). This means that large buffers (i.e. the actual array data of the numpy arrays in this case) can be handled separately. Using your example of the Persons class holding a numpy array, and adding a toy example of using out-of-band buffers: ```python # with pyrarow.(se)serialize persons_serialized = pa.serialize(PERSONS, context=context).to_buffer() %timeit pa.deserialize(persons_serialized, context=context) ... 439 µs ± 23.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # with pickle 5 persons_pickled = pickle.dumps(PERSONS, protocol=5) %timeit pickle.loads(persons_pickled) 57.3 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # with pickle 5 with out-of-band buffers buffers = [] persons_pickled = pickle.dumps(PERSONS, protocol=5, buffer_callback=buffers.append) %timeit pickle.loads(persons_pickled, buffers=buffers) 119 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` So here you can see that pyarrow.(de)serialize and pickle with out-of-band buffers are quite similar when deserializing (pickle protocol 5 is actually faster). Of course, this is also not a fully fair comparison. In the last case with out-of-band buffers, the ndarrays are recreated zero-copy (no copy of the actual array data is made), which is not the case with pyarrow.serialize (which will copy the data when creating the serialized buffer; you can do a similar zero-copy (de)serialization with pyarrow using `pa.serialize(..).to_components()` / `pa.deserialize_components(..)`). But in the end, the question is how you are using this (de)serialization. In your simplified example, you are serializing/deserializing the objects in-memory in the same python process. But that might not be your actual use case? (saving to a temporary file, passing through to another process in a multiprocessing context, ...) For example, in such a multiprocessing context, the out-of-band buffers can be communicated more efficiently separately. In addition, depending on the use case you might not (need to) use pickle directly, but you might be using it through another library. For example if you do multiprocessing via dask, it will already make use such out-of-band buffers for you) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
