[Numpy-discussion] Re: An extension of the .npy file format

Robert Kern Thu, 25 Aug 2022 09:28:47 -0700

On Thu, Aug 25, 2022 at 10:45 AM Qianqian Fang <fan...@gmail.com> wrote:


> I am curious what you and other developers think about adopting
> JSON/binary JSON as a similarly simple, reverse-engineering-able but
> universally parsable array exchange format instead of designing another
> numpy-specific binary format.
>
No one is really proposing another format, just a minor tweak to the
existing NPY format.

If you are proposing that numpy adopt BJData into numpy to underlay
`np.save()`, we are not very likely to for a number of reasons. However, if
you are addressing the wider community to advertise your work, by all means!

> I am interested in this topic (as well as thoughts among numpy developers)
> because I am currently working on a project - NeuroJSON (
> https://neurojson.org) - funded by the US National Institute of Health.
> The goal of the NeuroJSON project is to create easy-to-adopt,
> easy-to-extend, and preferably human-readable data formats to help
> disseminate and exchange neuroimaging data (and scientific data in
> general).
>
> Needless to say, numpy is a key toolkit that is widely used among
> neuroimaging data analysis pipelines. I've seen discussions of potentially
> adopting npy as a standardized way to share volumetric data (as ndarrays),
> such as in this thread
>
> https://github.com/bids-standard/bids-specification/issues/197
>
> however, several limitations were also discussed, for example
>
> 1. npy only support a single numpy array, does not support other metadata
> or other more complex data records (multiple arrays are only achieved via
> multiple files)
> 2. no internal (i.e. data-level) compression, only file-level compression
> 3. although the file is simple, it still requires a parser to read/write,
> and such parser is not widely available in other environments, making it
> mostly limited to exchange data among python programs
> 4. I am not entirely sure, but I suppose it does not support sparse
> matrices or special matrices (such as diagonal/band/symmetric etc) - I can
> be wrong though
>
> In the NeuroJSON project, we primarily use JSON and binary JSON
> (specifically, UBJSON <https://ubjson.org/> derived BJData
> <https://json.nlohmann.me/features/binary_formats/bjdata/> format) as the
> underlying data exchange files. Through standardized data annotations
> <https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-annotation-keywords>,
> we are able to address most of the above limitations - the generated files
> are universally parsable in nearly all programming environments with
> existing parsers, support complex hierarchical data, compression, and can
> readily benefit from the large ecosystems of JSON (JSON-schema, JSONPath,
> JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...).
>
I don't quite know what this means. My installed version of `jq`, for
example, doesn't seem to know what to do with these files.

❯ jq --version
jq-1.6

❯ jq . eye5chunk_bjd_raw.jdb
parse error: Invalid numeric literal at line 1, column 38
>
> I understand that simplicity is a key design spec here. I want to
> highlight UBJSON/BJData as a competitive alternative format. It is also
> designed with simplicity considered in the first place
> <https://ubjson.org/#why>, yet, it allows to store hierarchical
> strongly-typed complex binary data and is easily extensible.
>
> A UBJSON/BJData parser may not necessarily longer than a npy parser, for
> example, the python reader of the full spec only takes about 500 lines of
> codes (including comments), similarly for a JS parser
>
> https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py
> https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js
>
> We actually did a benchmark <https://github.com/neurolabusc/MeshFormatsJS>
> a few months back - the test workloads are two large 2D numerical arrays
> (node, face to store surface mesh data), we compared parsing speed of
> various formats in Python, MATLAB, and JS. The uncompressed BJData
> (BMSHraw) reported a loading speed that is nearly as fast as reading raw
> binary dump; and internally compressed BJData (BMSHz) gives the best
> balance between small file sizes and loading speed, see our results here
>
> https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large
>
> I want to add two quick points to echo the features you desired in npy:
>
> 1. it is not common to use mmap in reading JSON/binary JSON files, but it
> is certainly possible. I recently wrote a JSON-mmap spec
> <https://github.com/NeuroJSON/jsonmmap/blob/main/JSON-Mmap_Specification.md>
> and a MATLAB reference implementation
> <https://github.com/NeuroJSON/jsonmmap/tree/main/lib>
>
I think a fundamental problem here is that it looks like each element in
the array is delimited. I.e. a `float64` value starts with b'D' then the 8
IEEE-754 bytes representing the number. When we're talking about
memory-mappability, we are talking about having the on-disk representation
being exactly what it looks like in-memory, all of the IEEE-754 floats
contiguous with each other, so we can use the `np.memmap` `ndarray`
subclass to represent the on-disk data as a first-class array object. This
spec lets us mmap the binary JSON file and manipulate its contents in-place
efficiently, but that's not what is being asked for here.

> 2. UBJSON/BJData natively support append-able root-level records; JSON has
> been extensively used in data streaming with appendable nd-json or
> concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming)
>
>
> just a quick comparison of output file sizes with a 1000x1000 unitary
> diagonal matrix
>
> # python3 -m pip install jdata bjdata
> import numpy as np
> import jdata as jd
> x = np.eye(1000);       *# create a large array*
> y = np.vsplit(x, 5);    *# split into smaller chunks*
> np.save('eye5chunk.npy',y);             *# save npy*
> jd.save(y, 'eye5chunk_bjd_raw.jdb');    *# save as uncompressed bjd*
> jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'});  *#
> zlib-compressed bjd*
> jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'});  *#
> lzma-compressed bjd*
> newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding*
> newx = np.concatenate(newy);            *# regroup chunks*
> newx.dtype
>
>
> here are the output file sizes in bytes:
>
> 8000128  eye5chunk.npy
> 5004297  eye5chunk_bjd_raw.jdb
>
Just a note: This difference is solely due to a special representation of
`0` in 5 bytes rather than 8 (essentially, your encoder recognizes 0.0 as a
special value and uses the `float32` encoding of it). If you had any other
value making up the bulk of the file, this would be larger than the NPY due
to the additional delimiter b'D'.

>   10338  eye5chunk_bjd_zlib.jdb
>    2206  eye5chunk_bjd_lzma.jdb
>
> Qianqian
>
-- 
Robert Kern

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: An extension of the .npy file format

Reply via email to