I am curious what you and other developers think about adopting
JSON/binary JSON as a similarly simple, reverse-engineering-able but
universally parsable array exchange format instead of designing another
numpy-specific binary format.
I am interested in this topic (as well as thoughts among numpy
developers) because I am currently working on a project - NeuroJSON
(https://neurojson.org) - funded by the US National Institute of Health.
The goal of the NeuroJSON project is to create easy-to-adopt,
easy-to-extend, and preferably human-readable data formats to help
disseminate and exchange neuroimaging data (and scientific data in
general).
Needless to say, numpy is a key toolkit that is widely used among
neuroimaging data analysis pipelines. I've seen discussions of
potentially adopting npy as a standardized way to share volumetric data
(as ndarrays), such as in this thread
https://github.com/bids-standard/bids-specification/issues/197
however, several limitations were also discussed, for example
1. npy only support a single numpy array, does not support other
metadata or other more complex data records (multiple arrays are only
achieved via multiple files)
2. no internal (i.e. data-level) compression, only file-level compression
3. although the file is simple, it still requires a parser to
read/write, and such parser is not widely available in other
environments, making it mostly limited to exchange data among python
programs
4. I am not entirely sure, but I suppose it does not support sparse
matrices or special matrices (such as diagonal/band/symmetric etc) - I
can be wrong though
In the NeuroJSON project, we primarily use JSON and binary JSON
(specifically, UBJSON <https://ubjson.org/> derived BJData
<https://json.nlohmann.me/features/binary_formats/bjdata/> format) as
the underlying data exchange files. Through standardized data
annotations
<https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-annotation-keywords>,
we are able to address most of the above limitations - the generated
files are universally parsable in nearly all programming environments
with existing parsers, support complex hierarchical data, compression,
and can readily benefit from the large ecosystems of JSON (JSON-schema,
JSONPath, JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...).
I understand that simplicity is a key design spec here. I want to
highlight UBJSON/BJData as a competitive alternative format. It is also
designed with simplicity considered in the first place
<https://ubjson.org/#why>, yet, it allows to store hierarchical
strongly-typed complex binary data and is easily extensible.
A UBJSON/BJData parser may not necessarily longer than a npy parser, for
example, the python reader of the full spec only takes about 500 lines
of codes (including comments), similarly for a JS parser
https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py
https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js
We actually did a benchmark
<https://github.com/neurolabusc/MeshFormatsJS> a few months back - the
test workloads are two large 2D numerical arrays (node, face to store
surface mesh data), we compared parsing speed of various formats in
Python, MATLAB, and JS. The uncompressed BJData (BMSHraw) reported a
loading speed that is nearly as fast as reading raw binary dump; and
internally compressed BJData (BMSHz) gives the best balance between
small file sizes and loading speed, see our results here
https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large
I want to add two quick points to echo the features you desired in npy:
1. it is not common to use mmap in reading JSON/binary JSON files, but
it is certainly possible. I recently wrote a JSON-mmap spec
<https://github.com/NeuroJSON/jsonmmap/blob/main/JSON-Mmap_Specification.md>
and a MATLAB reference implementation
<https://github.com/NeuroJSON/jsonmmap/tree/main/lib>
2. UBJSON/BJData natively support append-able root-level records; JSON
has been extensively used in data streaming with appendable nd-json or
concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming)
just a quick comparison of output file sizes with a 1000x1000 unitary
diagonal matrix
|# python3 -m pip install jdata bjdata||
||import numpy as np||
||import jdata as jd||
||x = np.eye(1000); *# create a large array*||
||y = np.vsplit(x, 5); *# split into smaller chunks*||
||np.save('eye5chunk.npy',y); *# save npy*||
||jd.save(y, 'eye5chunk_bjd_raw.jdb'); *# save as uncompressed bjd*||
||jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'}); *#
zlib-compressed bjd*||
||jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'}); *#
lzma-compressed bjd*||
||newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding*||
||newx = np.concatenate(newy); *# regroup chunks*||
||newx.dtype|
here are the output file sizes in bytes:
|8000128 eye5chunk.npy||
||5004297 eye5chunk_bjd_raw.jdb||
|| 10338 eye5chunk_bjd_zlib.jdb||
|| 2206 eye5chunk_bjd_lzma.jdb|
Qianqian
On 8/24/22 15:48, Michael Siebert wrote:
Hi Matti, hi all,
@Matti: I don’t know what exactly you are referring to (Pull request
or the Github project, links see below). Maybe some clarification is
needed, which I hereby try to do ;)
A .npy file created by some appending process is a regular .npy file
and does not need to be read in chunks. Processing arrays larger than
the systems memory can already be done with memory mapping
(numpy.load(… mmap_mode=...)), so no third-party support is needed to
do so.
The idea is not necessarily to only write some known-but-fragmented
content to a .npy file in chunks or to only handle files larger than
the RAM.
It is more about the ability to append to a .npy file at any time and
between program runs. For example, in our case, we have a large
database-like file containing all (preprocessed) images of all videos
used to train a neural network. When new video data arrives, it can
simply be appended to the existing .npy file. When training the neural
net, the data is simply memory mapped, which happens basically
instantly and does not use extra space between multiple training
processes. We have tried out various fancy, advanced data formats for
this task, but most of them don’t provide the memory mapping feature
which is very handy to keep the time required to test a code change
comfortably low - rather, they have excessive parse/decompress times.
Also other libraries can also be difficult to handle, see below.
The .npy array format is designed to be limited. There is a NEP for
it, which summarizes the .npy features and concepts very well:
https://numpy.org/neps/nep-0001-npy-format.html
One of my favorite features (besides memory mapping perhaps) is this one:
“… Be reverse engineered. Datasets often live longer than the programs
that created them. A competent developer should be able to create a
solution in his preferred programming language to read most NPY files
that he has been given without much documentation. ..."
This is a big disadvantage with all the fancy formats out there: they
require dedicated libraries. Some of these libraries don’t come with
nice and free documentation (especially lacking
easy-to-use/easy-to-understand code examples for the target language,
e.g. C) and/or can be extremely complex, like HDF5. Yes, HDF5 has its
users and is totally valid if one operates the world’s largest
particle accelerator, but we have spend weeks finding some C/C++
library for it which does not expose bugs and is somehow documented.
We actually failed and posted a bug which was fixed a year later or
so. This can ruin entire projects - fortunately not ours, but it ate
up a lot of time we could have spend more meaningful. On the other
hand, I don’t see how e.g. zarr provides added value over .npy if one
only needs the .npy features and maybe some append-data-along-one-axis
feature. Yes, maybe there are some uses for two or three appendable
axes, but I think having one axis to append to should cover a lot of
use cases: this axis is typically time: video, audio, GPS, signal data
in general, binary log data, "binary CSV" (lines in file): all of
those only need one axis to append to.
The .npy format is so simple, it can be read e.g. in C in a few lines.
Or accessed easily through Numpy and ctypes by pointers for high speed
custom logic - not even requiring libraries besides Numpy.
Making .npy appendable is easy to implement. Yes, appending along one
axis is limited as the .npy format itself. But I consider that rather
to be a feature than a (actual) limitation, as it allows for fast and
simple appends.
The question is if there is some support for an
append-to-.npy-files-along-one-axis feature in the Numpy community and
if so, about the details of the actual implementation. I made one
suggestion in
https://github.com/numpy/numpy/pull/20321/
and I offer to invest time to update/modify/finalize the PR. I’ve also
created a library that can already append to .npy:
https://github.com/xor2k/npy-append-array
However, due to current limitations in the .npy format, the code is
more complex than it could actually be (the library initializes and
checks spare space in the header) and it needs to rewrite the header
every time. Both could be made unnecessary with a very small addition
to the .npy file format. Data would stay continuous (no
fragmentation!), there should just be a way to indicate that the
actual shape of the array should derived from the file size.
Best, Michael
On 24. Aug 2022, at 19:16, Matti Picus <matti.pi...@gmail.com> wrote:
Sorry for the late reply. Adding a new "*.npy" format feature to
allow writing to the file in chunks is nice but seems a bit limited.
As I understand the proposal, reading the file back can only be done
in the chunks that were originally written. I think other libraries
like zar or h5py have solved this problem in a more flexible way. Is
there a reason you cannot use a third-party library to solve this? I
would think if you have an array too large to write in one chunk you
will need third-party support to process it anyway.
Matti
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: michael.sieber...@gmail.com
_______________________________________________
NumPy-Discussion mailing list --numpy-discussion@python.org
To unsubscribe send an email tonumpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address:fan...@gmail.com
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com