[Numpy-discussion] Exporting numpy arrays to binary JSON (BJData) for better portability

Qianqian Fang Thu, 25 Aug 2022 14:42:05 -0700

To avoid derailing the other thread<https://mail.python.org/archives/list/numpy-discussion@python.org/thread/A4CJ2DZCAKPMD2MYGVMDV5UI7FN4SBVI/>on extending .npy files, I am going to start a new thread on alternativearray storage file formats using binary JSON - in case there is such aneed and interest among numpy users

specifically, i want to first follow up with Bill's question belowregarding loading time



On 8/25/22 11:02, Bill Ross wrote:


|Can you give load times for these?|

as I mentioned in the earlier reply to Robert, the most memory-efficient(i.e. fast loading, disk-mmap-able) but not necessarily disk-efficient(i.e. may result in the largest data file sizes) BJData construct tostore an ND array using BJData's ND-array container.

I have to admit that both jdata and bjdata modules have not beenextensively optimized for speed. with the current implementation, hereare the loading time for a larger diagonal matrix (eye(10000))

a BJData file storing a single eye(10000) array using the ND-arraycontainer can be downloaded from here<http://neurojson.org/wiki/upload/eye1e4_bjd_raw_ndsyntax.jdb.zip>(filesize: 1MB with zip, if decompressed, it is ~800MB, as the npy file) -this file was generated from a matlab encoder, but can be loaded usingPython (see below Re Robert).


|800000128 eye1e4.npy||
||800000014 eye1e4_bjd_raw_ndsyntax.jdb||
||   813721 eye1e4_bjd_zlib.jdb||
||   113067 eye1e4_bjd_lzma.jdb|

the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy1.19.5) for each file is listed below:


|0.179s  eye1e4.npy (mmap_mode=None)||
||0.001s  eye1e4.npy (mmap_mode=r)||
||0.718s  eye1e4_bjd_raw_ndsyntax.jdb||
||1.474s  eye1e4_bjd_zlib.jdb||
||0.635s  eye1e4_bjd_lzma.jdb|

clearly, mmapped loading is the fastest option without a surprise; it istrue that the raw bjdata file is about 5x slower than npy loading, butgiven the main chunk of the data are stored identically (as contiguousbuffer), I suppose with some optimization of the decoder, the gapbetween the two can be substantially shortened. The longer loading timeof zlib/lzma (and similarly saving times) reflects a trade-off betweensmaller file sizes and time for compression/decompression/disk-IO.

by no means I am saying the binary JSON format is readily available todeliver better speed with its current non-optimized implementation. Ijust want to bright the attention to this class of formats, andhighlight that it's flexibility gives abundant mechanisms to createfast, disk-mapped IO, while allowing additional benefits such ascompression, unlimited metadata for future extensions etc.

|> 8000128  eye5chunk.npy||
||> 5004297  eye5chunk_bjd_raw.jdb||
||>   10338  eye5chunk_bjd_zlib.jdb||
||>    2206  eye5chunk_bjd_lzma.jdb|
For my case, I'd be curious about the time to add one 1T-entries fileto another.

as I mentioned in the previous reply, bjdata is appendable<https://github.com/NeuroJSON/bjdata/blob/master/images/BJData_Diagram.pdf>,so you can simply append another array (or a slice) to the end of the file.

Thanks,
Bill




also related, Re @Robert's question below

Are any of them supported by a Python BJData implementation? I didn'tsee any option to get that done in the `bjdata` package yourecommended, for example.
https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200

the bjdata module currently only support nd-array in the decoder<https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/decoder.py#L360-L365>(i.e. map such buffer to a numpy.ndarray) - should be relatively trivialto add it to the encoder though.

on the other side, the annotated format is currently supported. one cancall jdata module (responsible for annotation-level encoding/decoding)as shown in my sample code, then call bjdata internally for dataserialization.

Okay. Given your wording, it looked like you were claiming that thebinary JSON was supported by the whole ecosystem. Rather, it seemslike you can either get binary encoding OR the ecosystem support, butnot both at the same time.

all in relative terms of course - json has ~100 listed parsers on it'swebsite <https://www.json.org/json-en.html>, MessagePack - anotherflavor of binary JSON - listed <https://msgpack.org/index.html> ~50/60parsers, and UBJSON listed <https://ubjson.org/libraries/> ~20 parsers.I am not familiar with npy parsers, but googling it returns only a few.

also, most binary JSON implementations provided tools to convert to JSONand back, so, in that sense, whatever JSON has in its ecosystem can be"potentially" used for binary JSON files if one wants to. There are alsorecent publications comparing differences between various binary JSONformats in case anyone is interested


https://github.com/ubjson/universal-binary-json/issues/115

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Exporting numpy arrays to binary JSON (BJData) for better portability

Reply via email to