To avoid derailing the other thread
<https://mail.python.org/archives/list/numpy-discussion@python.org/thread/A4CJ2DZCAKPMD2MYGVMDV5UI7FN4SBVI/>
on extending .npy files, I am going to start a new thread on alternative
array storage file formats using binary JSON - in case there is such a
need and interest among numpy users
specifically, i want to first follow up with Bill's question below
regarding loading time
On 8/25/22 11:02, Bill Ross wrote:
|Can you give load times for these?|
as I mentioned in the earlier reply to Robert, the most memory-efficient
(i.e. fast loading, disk-mmap-able) but not necessarily disk-efficient
(i.e. may result in the largest data file sizes) BJData construct to
store an ND array using BJData's ND-array container.
I have to admit that both jdata and bjdata modules have not been
extensively optimized for speed. with the current implementation, here
are the loading time for a larger diagonal matrix (eye(10000))
a BJData file storing a single eye(10000) array using the ND-array
container can be downloaded from here
<http://neurojson.org/wiki/upload/eye1e4_bjd_raw_ndsyntax.jdb.zip>(file
size: 1MB with zip, if decompressed, it is ~800MB, as the npy file) -
this file was generated from a matlab encoder, but can be loaded using
Python (see below Re Robert).
|800000128 eye1e4.npy||
||800000014 eye1e4_bjd_raw_ndsyntax.jdb||
|| 813721 eye1e4_bjd_zlib.jdb||
|| 113067 eye1e4_bjd_lzma.jdb|
the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy
1.19.5) for each file is listed below:
|0.179s eye1e4.npy (mmap_mode=None)||
||0.001s eye1e4.npy (mmap_mode=r)||
||0.718s eye1e4_bjd_raw_ndsyntax.jdb||
||1.474s eye1e4_bjd_zlib.jdb||
||0.635s eye1e4_bjd_lzma.jdb|
clearly, mmapped loading is the fastest option without a surprise; it is
true that the raw bjdata file is about 5x slower than npy loading, but
given the main chunk of the data are stored identically (as contiguous
buffer), I suppose with some optimization of the decoder, the gap
between the two can be substantially shortened. The longer loading time
of zlib/lzma (and similarly saving times) reflects a trade-off between
smaller file sizes and time for compression/decompression/disk-IO.
by no means I am saying the binary JSON format is readily available to
deliver better speed with its current non-optimized implementation. I
just want to bright the attention to this class of formats, and
highlight that it's flexibility gives abundant mechanisms to create
fast, disk-mapped IO, while allowing additional benefits such as
compression, unlimited metadata for future extensions etc.
|> 8000128 eye5chunk.npy||
||> 5004297 eye5chunk_bjd_raw.jdb||
||> 10338 eye5chunk_bjd_zlib.jdb||
||> 2206 eye5chunk_bjd_lzma.jdb|
For my case, I'd be curious about the time to add one 1T-entries file
to another.
as I mentioned in the previous reply, bjdata is appendable
<https://github.com/NeuroJSON/bjdata/blob/master/images/BJData_Diagram.pdf>,
so you can simply append another array (or a slice) to the end of the file.
Thanks,
Bill
also related, Re @Robert's question below
Are any of them supported by a Python BJData implementation? I didn't
see any option to get that done in the `bjdata` package you
recommended, for example.
https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200
the bjdata module currently only support nd-array in the decoder
<https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/decoder.py#L360-L365>
(i.e. map such buffer to a numpy.ndarray) - should be relatively trivial
to add it to the encoder though.
on the other side, the annotated format is currently supported. one can
call jdata module (responsible for annotation-level encoding/decoding)
as shown in my sample code, then call bjdata internally for data
serialization.
Okay. Given your wording, it looked like you were claiming that the
binary JSON was supported by the whole ecosystem. Rather, it seems
like you can either get binary encoding OR the ecosystem support, but
not both at the same time.
all in relative terms of course - json has ~100 listed parsers on it's
website <https://www.json.org/json-en.html>, MessagePack - another
flavor of binary JSON - listed <https://msgpack.org/index.html> ~50/60
parsers, and UBJSON listed <https://ubjson.org/libraries/> ~20 parsers.
I am not familiar with npy parsers, but googling it returns only a few.
also, most binary JSON implementations provided tools to convert to JSON
and back, so, in that sense, whatever JSON has in its ecosystem can be
"potentially" used for binary JSON files if one wants to. There are also
recent publications comparing differences between various binary JSON
formats in case anyone is interested
https://github.com/ubjson/universal-binary-json/issues/115
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com