pitrou commented on issue #11239: URL: https://github.com/apache/arrow/issues/11239#issuecomment-941116357
For the record, I get the following numbers here: **pickle5 with copies** ```pycon >>> %timeit persons_pickled = pickle5.dumps(PERSONS, protocol=5) 39.3 ms ± 389 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> %timeit persons_depickled = pickle5.loads(persons_pickled) 28.9 ms ± 71.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` **pickle5 with out-of-band buffers** ```pycon >>> %timeit buffers=[]; persons_pickled = pickle5.dumps(PERSONS, protocol=5, buffer_callback=buffers.append) 231 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) >>> %timeit persons_depickled = pickle5.loads(persons_pickled, buffers=buffers) 121 µs ± 336 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` **PyArrow serialization** ```pycon >>> %timeit persons_serialized = pa.serialize(PERSONS, context=context).to_buffer() 18.6 ms ± 79.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> %timeit persons_deserialized = pa.deserialize(persons_serialized, context=context) 398 µs ± 282 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each) ``` **Summary table** | | Serialization | Deserialization | | -- | -- | -- | | pickle5 with copies | 39.3 ms | 28.9 ms | | pickle5 with out-of-band-buffers | 231 µs | 121 µs | | PyArrow serialization | 18.6 ms | 398 µs | **Short analysis** By default, with `pickle` you pay the price of memory copies both for serialization and deserialization. PyArrow allows to avoid the price of memory copies for deserialization, but only on the read path. `pickle` out-of-band buffers avoid memory copies on _both_ sides. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
