[Numpy-discussion] Re: An extension of the .npy file format

Qianqian Fang Thu, 25 Aug 2022 07:46:55 -0700

I am curious what you and other developers think about adoptingJSON/binary JSON as a similarly simple, reverse-engineering-able butuniversally parsable array exchange format instead of designing anothernumpy-specific binary format.

I am interested in this topic (as well as thoughts among numpydevelopers) because I am currently working on a project - NeuroJSON(https://neurojson.org) - funded by the US National Institute of Health.The goal of the NeuroJSON project is to create easy-to-adopt,easy-to-extend, and preferably human-readable data formats to helpdisseminate and exchange neuroimaging data (and scientific data ingeneral).

Needless to say, numpy is a key toolkit that is widely used amongneuroimaging data analysis pipelines. I've seen discussions ofpotentially adopting npy as a standardized way to share volumetric data(as ndarrays), such as in this thread


https://github.com/bids-standard/bids-specification/issues/197

however, several limitations were also discussed, for example

1. npy only support a single numpy array, does not support othermetadata or other more complex data records (multiple arrays are onlyachieved via multiple files)

2. no internal (i.e. data-level) compression, only file-level compression

3. although the file is simple, it still requires a parser toread/write, and such parser is not widely available in otherenvironments, making it mostly limited to exchange data among pythonprograms4. I am not entirely sure, but I suppose it does not support sparsematrices or special matrices (such as diagonal/band/symmetric etc) - Ican be wrong though

In the NeuroJSON project, we primarily use JSON and binary JSON(specifically, UBJSON <https://ubjson.org/> derived BJData<https://json.nlohmann.me/features/binary_formats/bjdata/> format) asthe underlying data exchange files. Through standardized dataannotations<https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-annotation-keywords>,we are able to address most of the above limitations - the generatedfiles are universally parsable in nearly all programming environmentswith existing parsers, support complex hierarchical data, compression,and can readily benefit from the large ecosystems of JSON (JSON-schema,JSONPath, JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...).

I understand that simplicity is a key design spec here. I want tohighlight UBJSON/BJData as a competitive alternative format. It is alsodesigned with simplicity considered in the first place<https://ubjson.org/#why>, yet, it allows to store hierarchicalstrongly-typed complex binary data and is easily extensible.

A UBJSON/BJData parser may not necessarily longer than a npy parser, forexample, the python reader of the full spec only takes about 500 linesof codes (including comments), similarly for a JS parser


https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py
https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js

We actually did a benchmark<https://github.com/neurolabusc/MeshFormatsJS> a few months back - thetest workloads are two large 2D numerical arrays (node, face to storesurface mesh data), we compared parsing speed of various formats inPython, MATLAB, and JS. The uncompressed BJData (BMSHraw) reported aloading speed that is nearly as fast as reading raw binary dump; andinternally compressed BJData (BMSHz) gives the best balance betweensmall file sizes and loading speed, see our results here


https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large

I want to add two quick points to echo the features you desired in npy:

1. it is not common to use mmap in reading JSON/binary JSON files, butit is certainly possible. I recently wrote a JSON-mmap spec<https://github.com/NeuroJSON/jsonmmap/blob/main/JSON-Mmap_Specification.md>and a MATLAB reference implementation<https://github.com/NeuroJSON/jsonmmap/tree/main/lib>

2. UBJSON/BJData natively support append-able root-level records; JSONhas been extensively used in data streaming with appendable nd-json orconcatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming)

just a quick comparison of output file sizes with a 1000x1000 unitarydiagonal matrix


|# python3 -m pip install jdata bjdata||
||import numpy as np||
||import jdata as jd||
||x = np.eye(1000); *# create a large array*||
||y = np.vsplit(x, 5); *# split into smaller chunks*||
||np.save('eye5chunk.npy',y); *# save npy*||
||jd.save(y, 'eye5chunk_bjd_raw.jdb'); *# save as uncompressed bjd*||

||jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'}); *#zlib-compressed bjd*||||jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'}); *#lzma-compressed bjd*||

||newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding*||
||newx = np.concatenate(newy); *# regroup chunks*||
||newx.dtype|


here are the output file sizes in bytes:

|8000128  eye5chunk.npy||
||5004297  eye5chunk_bjd_raw.jdb||
||  10338  eye5chunk_bjd_zlib.jdb||
||   2206  eye5chunk_bjd_lzma.jdb|

Qianqian


On 8/24/22 15:48, Michael Siebert wrote:

Hi Matti, hi all,
@Matti: I don’t know what exactly you are referring to (Pull requestor the Github project, links see below). Maybe some clarification isneeded, which I hereby try to do ;)
A .npy file created by some appending process is a regular .npy fileand does not need to be read in chunks. Processing arrays larger thanthe systems memory can already be done with memory mapping(numpy.load(… mmap_mode=...)), so no third-party support is needed todo so.
The idea is not necessarily to only write some known-but-fragmentedcontent to a .npy file in chunks or to only handle files larger thanthe RAM.
It is more about the ability to append to a .npy file at any time andbetween program runs. For example, in our case, we have a largedatabase-like file containing all (preprocessed) images of all videosused to train a neural network. When new video data arrives, it cansimply be appended to the existing .npy file. When training the neuralnet, the data is simply memory mapped, which happens basicallyinstantly and does not use extra space between multiple trainingprocesses. We have tried out various fancy, advanced data formats forthis task, but most of them don’t provide the memory mapping featurewhich is very handy to keep the time required to test a code changecomfortably low - rather, they have excessive parse/decompress times.Also other libraries can also be difficult to handle, see below.
The .npy array format is designed to be limited. There is a NEP forit, which summarizes the .npy features and concepts very well:
https://numpy.org/neps/nep-0001-npy-format.html

One of my favorite features (besides memory mapping perhaps) is this one:
“… Be reverse engineered. Datasets often live longer than the programsthat created them. A competent developer should be able to create asolution in his preferred programming language to read most NPY filesthat he has been given without much documentation. ..."
This is a big disadvantage with all the fancy formats out there: theyrequire dedicated libraries. Some of these libraries don’t come withnice and free documentation (especially lackingeasy-to-use/easy-to-understand code examples for the target language,e.g. C) and/or can be extremely complex, like HDF5. Yes, HDF5 has itsusers and is totally valid if one operates the world’s largestparticle accelerator, but we have spend weeks finding some C/C++library for it which does not expose bugs and is somehow documented.We actually failed and posted a bug which was fixed a year later orso. This can ruin entire projects - fortunately not ours, but it ateup a lot of time we could have spend more meaningful. On the otherhand, I don’t see how e.g. zarr provides added value over .npy if oneonly needs the .npy features and maybe some append-data-along-one-axisfeature. Yes, maybe there are some uses for two or three appendableaxes, but I think having one axis to append to should cover a lot ofuse cases: this axis is typically time: video, audio, GPS, signal datain general, binary log data, "binary CSV" (lines in file): all ofthose only need one axis to append to.
The .npy format is so simple, it can be read e.g. in C in a few lines.Or accessed easily through Numpy and ctypes by pointers for high speedcustom logic - not even requiring libraries besides Numpy.
Making .npy appendable is easy to implement. Yes, appending along oneaxis is limited as the .npy format itself. But I consider that ratherto be a feature than a (actual) limitation, as it allows for fast andsimple appends.
The question is if there is some support for anappend-to-.npy-files-along-one-axis feature in the Numpy community andif so, about the details of the actual implementation. I made onesuggestion in
https://github.com/numpy/numpy/pull/20321/
and I offer to invest time to update/modify/finalize the PR. I’ve alsocreated a library that can already append to .npy:
https://github.com/xor2k/npy-append-array
However, due to current limitations in the .npy format, the code ismore complex than it could actually be (the library initializes andchecks spare space in the header) and it needs to rewrite the headerevery time. Both could be made unnecessary with a very small additionto the .npy file format. Data would stay continuous (nofragmentation!), there should just be a way to indicate that theactual shape of the array should derived from the file size.
Best, Michael
On 24. Aug 2022, at 19:16, Matti Picus <matti.pi...@gmail.com> wrote:
Sorry for the late reply. Adding a new "*.npy" format feature toallow writing to the file in chunks is nice but seems a bit limited.As I understand the proposal, reading the file back can only be donein the chunks that were originally written. I think other librarieslike zar or h5py have solved this problem in a more flexible way. Isthere a reason you cannot use a third-party library to solve this? Iwould think if you have an array too large to write in one chunk youwill need third-party support to process it anyway.
Matti

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: michael.sieber...@gmail.com
_______________________________________________
NumPy-Discussion mailing list --numpy-discussion@python.org
To unsubscribe send an email tonumpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address:fan...@gmail.com

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: An extension of the .npy file format

Reply via email to