[Numpy-discussion] An extension of the .npy file format

Michael Siebert Sat, 08 Jan 2022 16:07:55 -0800

Dear all,

originally, I have planned to make an extension of the .npy file format a
dedicated follow-up pull request, but I have upgraded my current request
instead, since it was not as difficult to implement as I initially thought
and probably a more straight-forward solution:


https://github.com/numpy/numpy/pull/20321/

What is this pull request about? It is about appending to Numpy .npy files.
Why? I see two main use cases:

   1. creating .npy files larger than the main memory. They can, once
   finished, be loaded as memory maps
   2. creating binary log files, which can be processed very efficiently
   without parsing

Are there not other good file formats to do this? Theoretically yes, but
practically they can be pretty complex and with very little tweaking .npy
could do efficient appending too.

Use case 1 is already covered by the Pip/Conda package npy-append-array I
have created and getting the functionality directly into Numpy was the
original goal of the pull request. This would have been possible without
introducing a new file format version, just by adding some spare space in
the header. During the pull request discussion it turned out that rewriting
the header after each append would be desirable in case the writing program
crashes to minimize data loss.

Use case 2 however would highly profit from a new file format version as it
would make rewriting the header unnecessary: since efficient appending can
only take place along one axis, setting shape[-1] = -1 in case of Fortran
order or shape[0] = -1 otherwise (default) in the .npy header on file
creation could indicate that the array size is determined by the file size:
when np.load (typically with memory mapping on) gets called, it constructs
the ndarray with the actual shape by replacing the -1 in the constructor
call. Otherwise, the header is not modified anymore, neither on append nor
on file write finish.

Concurrent appends to a single file would not be advisable and should be
channeled through a single AppendArray instance. Concurrent reads while
writes take place however should work relatively smooth: every time an
np.load (ideally with mmap) is called, the ndarray would provide access to
all data written until that time.

Currently, my pull request provides:

   1. A definition of .npy version 4.0 that supports -1 in the shape
   2. implementations for fortran order and non-fortran order (default)
   including test cases
   3. Updated np.load
   4. The AppendArray class that does the actual appending

Although there is a certain hassle with introducing a new .npy version, the
changes themselves are very small. I could also implement a fallback mode
for older Numpy installations, if someone is interested.

What do you think about such a feature, would it make sense? Anyone
available for some more code review?

Best from Berlin, Michael

PS thank you so far, I could improve my npy-append-array module as well and
from what I have seen so far the Numpy code readability exceeded my already
high expectations.

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] An extension of the .npy file format

Reply via email to