Currently, when NumPy saves data using pickle, it hard-coded the protocol
version to 3, which was the default value from Python 3.0 to Python 3.7.
However, since Python 3.7 has reached its end-of-life (EOL), there are no
actively maintained Python versions that default to using pickle protocol 3.  

  

By allowing NumPy to use the default pickle protocol version, objects larger
than 4GB can be pickled, resolving the issue described in #26224 [1]. Although
this new protocol version is incompatible with Python 3.3 and earlier
versions, Python 3.3 has long reached its EOL. Therefore, there is no need to
force pickle protocol 3 when saving data using pickle.  

  

One of the reasons for having a hard-coded version is that upstream Python may
bump the default version to a new version that will break the compatibility
between two supported versions of NumPy. This scenario won't happen as Python
core developers promised that it's a policy to only use the protocol supported
by all currently supported versions of Python as default.[2][3]  

  

Another reason for having a different NumPy default version of pickle protocol
is that because NumPy does not support every old Python version, we can
actually bump the default version earlier than upstream Python to get the
performance gains.  

  

I have done some experiments[4] that show while pickle protocol 4 improved
about 4% for overall np.save performance, protocol 5 did not have that kind of
performance improvement. Therefore, I'm not sure whether it is worth it to add
a NumPy default version of pickle protocol. Protocol 5 use out-of-band buffer
which may increase performance for lots of small pickles, but since np.save
typical usage is saving to a file, filesystem overhead on creating lots of
small files is the main bottleneck.  

  

There are two possible solutions to this problem. One is to just follow the
upstream protocol version(which is currently 4) and expect them to bump it
wisely when they should. The second is to add a pickle_protocol keyword to the
numpy.save function, defaulting to the highest version that the oldest Python
version NumPy supports (which is 5). Since this introduces some complexity
into a pretty base interface for NumPy, I believe it needs members of NumPy to
decide which way to go. I can finish the documentation (with an explanation of
performance differences on different protocol versions of pickle) and add this
to the release checklist of NumPy and code if we decide to add this keyword.  

  

I can also see if we can utilize pickle protocol 5's improvement with reuse of
out-of-band buffers to keep things in cache and improve performance in another
PR, if we decide to add NumPy's own default version of pickle protocol and set
it to 5.  

  

Best regards,  

Chunqing Shan  

  

1\. <https://github.com/numpy/numpy/issues/26224>  

2\.
<https://github.com/python/cpython/blob/5a1618a2c8c108b8c73aa9459b63f0dbd66b60f6/Lib/pickle.py#L68-L70>
  

3\. <https://github.com/numpy/numpy/pull/26388#issuecomment-2097197375>  

4\. With  

a = np.array(np.zeros((1024, 1024, 24, 7)), dtype=object) # 1.5GB  

and time the following statement(/dev/shm is configured as a tmpfs)  

np.save('/dev/shm/a.npy', a)  

For protocol 3, on average 4.25s is required; for protocol 4, on average
4.07s; for protocol 5, on average 4.06s. This result is acquired on a Xeon
E-2286G bare metal with 2 16GB 2667 MT/s DDR4 ECC, with Python 3.11.2 and
NumPy 1.26.4.  

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to