On Thu, Aug 25, 2022 at 10:45 AM Qianqian Fang <fan...@gmail.com> wrote:
> I am curious what you and other developers think about adopting > JSON/binary JSON as a similarly simple, reverse-engineering-able but > universally parsable array exchange format instead of designing another > numpy-specific binary format. > No one is really proposing another format, just a minor tweak to the existing NPY format. If you are proposing that numpy adopt BJData into numpy to underlay `np.save()`, we are not very likely to for a number of reasons. However, if you are addressing the wider community to advertise your work, by all means! > I am interested in this topic (as well as thoughts among numpy developers) > because I am currently working on a project - NeuroJSON ( > https://neurojson.org) - funded by the US National Institute of Health. > The goal of the NeuroJSON project is to create easy-to-adopt, > easy-to-extend, and preferably human-readable data formats to help > disseminate and exchange neuroimaging data (and scientific data in > general). > > Needless to say, numpy is a key toolkit that is widely used among > neuroimaging data analysis pipelines. I've seen discussions of potentially > adopting npy as a standardized way to share volumetric data (as ndarrays), > such as in this thread > > https://github.com/bids-standard/bids-specification/issues/197 > > however, several limitations were also discussed, for example > > 1. npy only support a single numpy array, does not support other metadata > or other more complex data records (multiple arrays are only achieved via > multiple files) > 2. no internal (i.e. data-level) compression, only file-level compression > 3. although the file is simple, it still requires a parser to read/write, > and such parser is not widely available in other environments, making it > mostly limited to exchange data among python programs > 4. I am not entirely sure, but I suppose it does not support sparse > matrices or special matrices (such as diagonal/band/symmetric etc) - I can > be wrong though > > In the NeuroJSON project, we primarily use JSON and binary JSON > (specifically, UBJSON <https://ubjson.org/> derived BJData > <https://json.nlohmann.me/features/binary_formats/bjdata/> format) as the > underlying data exchange files. Through standardized data annotations > <https://github.com/NeuroJSON/jdata/blob/master/JData_specification.md#data-annotation-keywords>, > we are able to address most of the above limitations - the generated files > are universally parsable in nearly all programming environments with > existing parsers, support complex hierarchical data, compression, and can > readily benefit from the large ecosystems of JSON (JSON-schema, JSONPath, > JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...). > I don't quite know what this means. My installed version of `jq`, for example, doesn't seem to know what to do with these files. ❯ jq --version jq-1.6 ❯ jq . eye5chunk_bjd_raw.jdb parse error: Invalid numeric literal at line 1, column 38 > > I understand that simplicity is a key design spec here. I want to > highlight UBJSON/BJData as a competitive alternative format. It is also > designed with simplicity considered in the first place > <https://ubjson.org/#why>, yet, it allows to store hierarchical > strongly-typed complex binary data and is easily extensible. > > A UBJSON/BJData parser may not necessarily longer than a npy parser, for > example, the python reader of the full spec only takes about 500 lines of > codes (including comments), similarly for a JS parser > > https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py > https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js > > We actually did a benchmark <https://github.com/neurolabusc/MeshFormatsJS> > a few months back - the test workloads are two large 2D numerical arrays > (node, face to store surface mesh data), we compared parsing speed of > various formats in Python, MATLAB, and JS. The uncompressed BJData > (BMSHraw) reported a loading speed that is nearly as fast as reading raw > binary dump; and internally compressed BJData (BMSHz) gives the best > balance between small file sizes and loading speed, see our results here > > https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png&name=large > > I want to add two quick points to echo the features you desired in npy: > > 1. it is not common to use mmap in reading JSON/binary JSON files, but it > is certainly possible. I recently wrote a JSON-mmap spec > <https://github.com/NeuroJSON/jsonmmap/blob/main/JSON-Mmap_Specification.md> > and a MATLAB reference implementation > <https://github.com/NeuroJSON/jsonmmap/tree/main/lib> > I think a fundamental problem here is that it looks like each element in the array is delimited. I.e. a `float64` value starts with b'D' then the 8 IEEE-754 bytes representing the number. When we're talking about memory-mappability, we are talking about having the on-disk representation being exactly what it looks like in-memory, all of the IEEE-754 floats contiguous with each other, so we can use the `np.memmap` `ndarray` subclass to represent the on-disk data as a first-class array object. This spec lets us mmap the binary JSON file and manipulate its contents in-place efficiently, but that's not what is being asked for here. > 2. UBJSON/BJData natively support append-able root-level records; JSON has > been extensively used in data streaming with appendable nd-json or > concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming) > > > just a quick comparison of output file sizes with a 1000x1000 unitary > diagonal matrix > > # python3 -m pip install jdata bjdata > import numpy as np > import jdata as jd > x = np.eye(1000); *# create a large array* > y = np.vsplit(x, 5); *# split into smaller chunks* > np.save('eye5chunk.npy',y); *# save npy* > jd.save(y, 'eye5chunk_bjd_raw.jdb'); *# save as uncompressed bjd* > jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'}); *# > zlib-compressed bjd* > jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'}); *# > lzma-compressed bjd* > newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding* > newx = np.concatenate(newy); *# regroup chunks* > newx.dtype > > > here are the output file sizes in bytes: > > 8000128 eye5chunk.npy > 5004297 eye5chunk_bjd_raw.jdb > Just a note: This difference is solely due to a special representation of `0` in 5 bytes rather than 8 (essentially, your encoder recognizes 0.0 as a special value and uses the `float32` encoding of it). If you had any other value making up the bulk of the file, this would be larger than the NPY due to the additional delimiter b'D'. > 10338 eye5chunk_bjd_zlib.jdb > 2206 eye5chunk_bjd_lzma.jdb > > Qianqian > -- Robert Kern
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com