[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Francesc Alted Thu, 01 Sep 2022 05:46:12 -0700

Hi,

On Thu, Sep 1, 2022 at 6:18 AM Qianqian Fang <fan...@gmail.com> wrote:


> On 8/30/22 06:29, Francesc Alted wrote:
>
>
> Not exactly.  What we've done is to encode the header and the trailer
> (i.e. where the metadata is) of the frame with msgpack.  The chunks
> section
> <https://github.com/Blosc/c-blosc2/blob/main/README_CFRAME_FORMAT.rst#chunks>
>  is where the actual data is; this section does not follow a msgpack
> structure as such, but it is rather a sequence of data chunks and an index
> (for quickly locating the chunks).  You can easily access the header or
> trailer sections reading from the start or the end of the frame.  This way
> you don't need to update the indexes of chunks in msgpack, which can be
> expensive during data updates.
>
> This indeed prevents data to be dumped by using typical msgpack tools, but
> our sense is that users should care mostly about metainfo, and let the
> libraries to deal with the actual data in the most efficient way.
>
>
> thanks for your detailed reply. I spent the past few days reading the
> links/documentations, as well as experimenting the blosc2 meta-compressors,
> I was quite impressed by the performance of blosc2. I was also happy to see
> great alignments behind the drives for Caterva those of NeuroJSON.
>
> I have a few quick updates
>
> 1. I added blosc2 as a codec in my jdata module, as an alternative
> compressor to zlib/lzma/lz4
>
>
> https://github.com/NeuroJSON/pyjdata/commit/ce25fa53ce73bf4cbe2cff9799b5a616e2cd75cb
>

Looks good!  Although if you want to support arrays larger than 2 GB, you'd
better use a frame (as I was doing in my example; see also
https://github.com/Blosc/python-blosc2/blob/082db1d2d2ec9afac653903775e2dccac97e2bc9/examples/schunk.py).
Also, the frame is the one using the msgpack for storing the metainfo.


> 2. as I mentioned, jdata/bjdata were not optimized for speed, they contain
> many inefficient handling of numpy arrays (as I discovered); after some
> profiling, I was able to remove most of those, the run-time is now nearly
> entirely spent in compression/decompression (see attached profiler outputs
> for the `zlib` compressor benchmark)
>

Looks good.


> 3. the new jdata that supports blosc2, v0.5.0, has been tagged and
> uploaded (https://pypi.org/project/jdata)
>

That's great.  Although as said, switching from a single chunk into a frame
would allow you to store data > 2 GB.  Whether or not this a goal for you,
I don't know.


> 4. I wrote a script and compared the run times of various codecs (using
> BJData and JSON as containers) , the code can be found here
>
> https://github.com/NeuroJSON/pyjdata/blob/master/test/benchcodecs.py
>
> the save/load times tested on a Ryzen 9 3950X/Ubuntu 18.04 box (at various
> threads) are listed below (similar to your posted before)
>
>
> *- Testing npy/npz*
>   'npy',        'save' 0.2914195 'load' 0.1963226  'size'  800000128
>   'npz',        'save' 2.8617918 'load' 1.9550347  'size'  813846
>
> *- Testing text-based JSON files (.jdt)** (nthread=8)...*
>   'zlib',       'save' 2.5132861 'load' 1.7221164  'size'  1084942
>   'lzma',       'save' 9.5481696 'load' 0.3865211  'size'  150738
>   'lz4',        'save' 0.3467197 'load' 0.5019965  'size'  4495297
>   'blosc2blosclz'save' 0.0165646 'load' 0.1143934  'size'  1092747
>   'blosc2lz4',  'save' 0.0175058 'load' 0.1015181  'size'  1090159
>   'blosc2lz4hc','save' 0.2102167 'load' 0.1053235  'size'  4315421
>   'blosc2zlib', 'save' 0.1002635 'load' 0.1188845  'size'  1270252
>   'blosc2zstd', 'save' 0.0463817 'load' 0.1017909  'size'  253176
>
> *- Testing binary JSON (BJData) files (.jdb) (nthread=8)...*
>   'zlib',       'save' 2.4401443 'load' 1.6316463  'size'  813721
>   'lzma',       'save' 9.3782029 'load' 0.3728334  'size'  113067
>   'lz4',        'save' 0.3389360 'load' 0.5017435  'size'  3371487
>   'blosc2blosclz'save' 0.0173912 'load' 0.1042985  'size'  819576
>   'blosc2lz4',  'save' 0.0133688 'load' 0.1030941  'size'  817635
>   'blosc2lz4hc','save' 0.1968047 'load' 0.0950071  'size'  3236580
>   'blosc2zlib', 'save' 0.1023218 'load' 0.1083922  'size'  952705
>   'blosc2zstd', 'save' 0.0468430 'load' 0.1019175  'size'  189897
>
> *- Testing binary JSON (BJData) files (.jdb) **(nthread=1)...*
>   'blosc2blosclz'save' 0.0883078 'load' 0.2432985  'size'  819576
>   'blosc2lz4',  'save' 0.0867996 'load' 0.2394990  'size'  817635
>   'blosc2lz4hc','save' 2.4794559 'load' 0.2498981  'size'  3236580
>   'blosc2zlib', 'save' 0.7477457 'load' 0.4873921  'size'  952705
>   'blosc2zstd', 'save' 0.3435547 'load' 0.3754863  'size'  189897
>
> *- Testing binary JSON (BJData) files (.jdb) **(nthread=32)...*
>   'blosc2blosclz'save' 0.0197186 'load' 0.1410989  'size'  819576
>   'blosc2lz4',  'save' 0.0168068 'load' 0.1414074  'size'  817635
>   'blosc2lz4hc','save' 0.0790011 'load' 0.0935394  'size'  3236580
>   'blosc2zlib', 'save' 0.0608818 'load' 0.0985531  'size'  952705
>   'blosc2zstd', 'save' 0.0370790 'load' 0.0945577  'size'  189897
>
> a few observations:
>
> 1. single-threaded zlib/lzma are relatively slow, reflected by npz, zlib
> and lzma results
>
> 2. for simple data structure like this one, using JSON/text-based wrapper
> vs a binary wrapper has a marginal difference in speed; the only penalty is
> that text/JSON is ~33% larger than binary in size due to base64
>

Hmm, base64 almost not adding overhead is quite surprising to me actually,
because this is adding at least a copy.  What could be happening here is
that the compressed size is so small that this doesn't affect performance
too much; but in a general case, I'd really expect converting to/from
base64 to have a noticeable impact indeed.


> 3. blosc2 overall delivered very impressive speed - even in single thread,
> it can be than faster than uncompressed npz or other standard compression
> methods
>
> 4. several blosc2 compressors scaled well with more threads
>
> 5. it is a bit strange that blosc2lz4hc yielded larger file size, similar
> to that from a standard lz4, but blosc2lz4 produces a size comparable to
> zlib; I expected reverted findings, because lz4hc is supposed to give
> "higher-compression"
>
True.  The fact is that this is the first time I'm seeing this behavior in
lz4hc.  The normal situation is more similar to this:
https://github.com/Blosc/python-blosc2/blob/main/README.rst#benchmarking.
Why lz4hc is not behaving 'well' in terms of compression ratio here escapes
to me.  Could you come with a small reproducible example and open a ticket
in the C-Blosc2 project?  We will look into this.


> one question I have is: how stable is your format spec? do you see the
> buffers compressed by your current blosc2 library be still opened/parsed by
> your future releases (at least with an intent to)?
>

The Blosc2 format (both chunk
<https://github.com/Blosc/c-blosc2/blob/main/README_CHUNK_FORMAT.rst> and
frame <https://github.com/Blosc/c-blosc2/blob/main/README_CFRAME_FORMAT.rst>)
has been declared stable since May 2021
<https://www.blosc.org/posts/blosc2-ready-general-review/>.  You should
expect future versions of Blosc2 to be able to read the data stored in that
format since then.


> Not quite.  Blosc2 does not use the multi-threaded version of zstd; it
> rather implements its own internal multi-threading engine and hence all the
> codecs (and filters) benefit from it, so no need to trust on a
> multi-threaded codec for speed.  Also, as filters execute prior to codecs,
> they can reuse the same internal buffers, avoiding copies (which is
> critical for achieving high I/O performance).
>
>
> As said, we are not using packed ND array in msgpack, but rather, using
> our own schema.  Blosc2 supports the concept of metalayers for adding new
> meaning to the stored data (see
> https://www.blosc.org/docs/Caterva-Blosc2-SciPy2019.pdf, slide 17).  One
> of these layers is Caterva, where we have added support for MD arrays
> <https://github.com/Blosc/caterva/blob/master/CATERVA_METALAYER.rst>.
> Note that our implementation for supporting ND arrays uses two levels of
> partitioning (chunks and blocks) for:
>
> 1. Allow finer granularity
> <https://www.blosc.org/posts/caterva-slicing-perf/> in retrieving data.
>
> 2. Better adapt to the memory hierarchies (i.e. main memory and cache
> levels in CPU) for efficiency
> <https://www.blosc.org/posts/breaking-memory-walls/>.
>
> OTOH, I have noticed that your patch for msgpack
> <https://github.com/msgpack/msgpack/pull/267/files#diff-bc6661da34ecae62fbe724bb93fd69b91a7f81143f2683a81163231de7e3b545R334>
>  only suggest to use uint32 as the type for array shape.  This would
> prevent to use creating arrays where some dim is larger than 2^32.  Is that
> intended?
>
>
> see the last part of this post
>
> https://github.com/msgpack/msgpack/issues/268#issuecomment-495050845
>
> in BJData, the ND-array dimensional vector supports different integer
> types
> <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md#optimized-n-dimensional-array-of-uniform-type>
>

Ok.  A lot of details escaped to me.  Thanks.


>
> I see your point, and your intent is really appreciated.  It is just in
> the 10's GB and up domain that I see BJData a bit lacking in that text
> handling tools (strings, sed, not to mention editors, where you can run out
> of memory very soon) can become unnecessarily slow for retrieving the
> metainfo.  We really feel that such metainfo should go either at the
> beginning or at the end of the frame, where it can be found and processed
> way more efficiently.
>
>
> regardless which serialization format is chosen, I think both projects see
> the needs to store hierarchical metadata along-side with the data. I agree
> with you that if reading/searching metadata is desired, header&trailer are
> the best places. For efficient search of metadata while accommodating large
> amount of binary data in scales, CouchDB/MongoDB use "attachments" to hold
> large binary data. The metadata tree and the attachment can be linked using
> a simple UUID or JSON-reference string
>
Yes, using separate storage is fine for not contiguous storage (aka
frames), but unfortunately not possible for our case.

>
> OTOH, I agree in that msgpack is not human readable directly, but the
> format is becoming so ubiquitous that you can find standard tools for
> introspecting metadata quite easily
>
>
> it would be nice to store the header data in a map so it can be
> self-explanatory (with just a small cost of size). I am even willing go as
> far as adding non-essential metadata that can help make the data file as
> self-explained as possible, such as spec, schemas and parsers, just because
> the format can and it costs almost nothing
>
>
> https://github.com/rordenlab/dcm2niix/blob/v1.0.20220720/console/nii_dicom_batch.cpp#L4334-L4344
>
Agreed that using maps would add more readability, but they consume also
more space.  Right now the minimum header size in Blosc2 is around 120
bytes, which, with some metainfo could go around 200 bytes.  Having more
than that would be an unnecessary waste when you are storing small data or
highly compressible data.


> :
>
> $ msgpack2json -di eye5_blosc2_blosclz.b2frame
> [
> ...
> ]
>
> And, as there are msgpack libraries for almost all of the currently used
> languages, I think that formats based on it are as open and transparent as
> we can get.
>
>
>>
>> again, I applaud the wonderful works from the blosc2 team and have no
>> doubt it has many advantages to offer to sharing array data, on the other
>> side, I do want to advocate for considering readability and portability to
>> the data files. Essentially the NeuroJSON specs
>> <http://neurojson.org/#specs> (JData
>> <https://github.com/NeuroJSON/jdata/blob/Draft_2/JData_specification.md>,
>>  BJData
>> <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md>,
>> etc) are taking the mission of building a "source-code language" for
>> scientific data storage.
>>
>
> Thanks, I concur with your work too!  It is always nice to discuss with
> people that has put a lot of thought in how to pack data efficiently, and
> as simply as possible (but not any simpler!).  Actually, we might be
> adopting some aspects of JData <https://github.com/fangq/jdata> to be
> able to store different objects (arrays, tables, graphs, trees...) in the
> same frame in a future possible extension of Blosc2.  Or, maybe using JData
> as the external container for existing Blosc2 frames.  Very interesting
> discussion indeed; many possibilities are open now!
>
>
> will be absolutely happy to explore collaboration possibilities. will
> reach out offline.
>

Cool.  Let's keep in touch.

-- 
Francesc Alted

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Reply via email to