Currently, when NumPy saves data using pickle, it hard-coded the protocol version to 3, which was the default value from Python 3.0 to Python 3.7. However, since Python 3.7 has reached its end-of-life (EOL), there are no actively maintained Python versions that default to using pickle protocol 3.
By allowing NumPy to use the default pickle protocol version, objects larger than 4GB can be pickled, resolving the issue described in #26224 [1]. Although this new protocol version is incompatible with Python 3.3 and earlier versions, Python 3.3 has long reached its EOL. Therefore, there is no need to force pickle protocol 3 when saving data using pickle. One of the reasons for having a hard-coded version is that upstream Python may bump the default version to a new version that will break the compatibility between two supported versions of NumPy. This scenario won't happen as Python core developers promised that it's a policy to only use the protocol supported by all currently supported versions of Python as default.[2][3] Another reason for having a different NumPy default version of pickle protocol is that because NumPy does not support every old Python version, we can actually bump the default version earlier than upstream Python to get the performance gains. I have done some experiments[4] that show while pickle protocol 4 improved about 4% for overall np.save performance, protocol 5 did not have that kind of performance improvement. Therefore, I'm not sure whether it is worth it to add a NumPy default version of pickle protocol. Protocol 5 use out-of-band buffer which may increase performance for lots of small pickles, but since np.save typical usage is saving to a file, filesystem overhead on creating lots of small files is the main bottleneck. There are two possible solutions to this problem. One is to just follow the upstream protocol version(which is currently 4) and expect them to bump it wisely when they should. The second is to add a pickle_protocol keyword to the numpy.save function, defaulting to the highest version that the oldest Python version NumPy supports (which is 5). Since this introduces some complexity into a pretty base interface for NumPy, I believe it needs members of NumPy to decide which way to go. I can finish the documentation (with an explanation of performance differences on different protocol versions of pickle) and add this to the release checklist of NumPy and code if we decide to add this keyword. I can also see if we can utilize pickle protocol 5's improvement with reuse of out-of-band buffers to keep things in cache and improve performance in another PR, if we decide to add NumPy's own default version of pickle protocol and set it to 5. Best regards, Chunqing Shan 1\. <https://github.com/numpy/numpy/issues/26224> 2\. <https://github.com/python/cpython/blob/5a1618a2c8c108b8c73aa9459b63f0dbd66b60f6/Lib/pickle.py#L68-L70> 3\. <https://github.com/numpy/numpy/pull/26388#issuecomment-2097197375> 4\. With a = np.array(np.zeros((1024, 1024, 24, 7)), dtype=object) # 1.5GB and time the following statement(/dev/shm is configured as a tmpfs) np.save('/dev/shm/a.npy', a) For protocol 3, on average 4.25s is required; for protocol 4, on average 4.07s; for protocol 5, on average 4.06s. This result is acquired on a Xeon E-2286G bare metal with 2 16GB 2667 MT/s DDR4 ECC, with Python 3.11.2 and NumPy 1.26.4.
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com