[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Francesc Alted Tue, 30 Aug 2022 03:31:09 -0700

Hi,

Thanks for the detailed description on what you are pursuing.  Find my
comments below.


On Sat, Aug 27, 2022 at 6:17 PM Qianqian Fang <[email protected]> wrote:

> hi Francesc,
>
> wonderful works on blosc2! congrats! this is exactly the direction that I
> would hope more data creators/data users would pay attention to.
>
> clearly blosc2 is a well positioned for high performance - msgpack is one
> of the most proliferated binary JSON formats out there, with many
> extensively optimized libraries; zstd is also a rapidly emerging
> compression class that has a well developed multi-threading support. this
> combination likely has the best that the current toolchain can offer to
> deliver good performance and robustness. The added SIMD and data chunking
> features further push the performance bar.
>
> I am aware that msgpack does not currently support packed ND-array data
> type (see my PR to add this syntax at
> https://github.com/msgpack/msgpack/pull/267), I suppose blosc2 must have
> been using customized  buffers warped under an ext32 container, is that the
> case? or you implemented your own unofficial ext64 type?
>

Not exactly.  What we've done is to encode the header and the trailer (i.e.
where the metadata is) of the frame with msgpack.  The chunks section
<https://github.com/Blosc/c-blosc2/blob/main/README_CFRAME_FORMAT.rst#chunks>
 is where the actual data is; this section does not follow a msgpack
structure as such, but it is rather a sequence of data chunks and an index
(for quickly locating the chunks).  You can easily access the header or
trailer sections reading from the start or the end of the frame.  This way
you don't need to update the indexes of chunks in msgpack, which can be
expensive during data updates.

This indeed prevents data to be dumped by using typical msgpack tools, but
our sense is that users should care mostly about metainfo, and let the
libraries to deal with the actual data in the most efficient way.

> I am not surprised to see blosc2 outperforms npz/jdb in compression
> benchmarks because zstd supports multi-threading, that makes a huge
> difference, as shown clearly in this 2017 benchmark that I found online
>
> https://community.centminmod.com/threads/compression-comparison-benchmarks-zstd-vs-brotli-vs-pigz-vs-bzip2-vs-xz-etc.12764/
>
> using the multi-threaded versions of zlib (pigz) and lzma (pxz, pixz, or
> plzip) would be a more apple-to-apple comparison, but I do believe zstd may
> still hold an edge in speed (but may trade for less compression ratio). I
> also noticed that lbzip2 also gives relatively good speed and high
> compression ratio. Nothing beats lzma (lzma/zip/xz) in compression ratio,
> even with the highest setting in zstd.
>
Not quite.  Blosc2 does not use the multi-threaded version of zstd; it
rather implements its own internal multi-threading engine and hence all the
codecs (and filters) benefit from it, so no need to trust on a
multi-threaded codec for speed.  Also, as filters execute prior to codecs,
they can reuse the same internal buffers, avoiding copies (which is
critical for achieving high I/O performance).

I absolutely agree with you that different flavors of binary JSON formats
> (Msgpack vs CBOR vs BSON vs UBJSON vs BJData) matters little because they
> are all JSON-convertible and follow the same design principles as JSON -
> namely simplicity, generality and lightweight.
>
> I did make some deliberations when deciding whether to use Msgpack vs
> UBJSON/BJData as the main binary format for NeuroJSON, there were two
> things had steered my decision:
>
> 1. there is *no official packed ND array support* in both Msgpack and
> UBJSON. ND-array is such a fundamental data structure for scientific data
> storage and it has to be the first-class citizen in data serialization
> formats - storing an ND array in nested 1D list, as done in standard
> msgpack/ubjson, not only lose the dimensional regularity but also adds
> overheads and breaks the continuous binary buffer. That was the main reason
> that I had to extend UBJSON
> <https://groups.google.com/g/universal-binary-json/c/tgMCEbOmhes/m/s7JlCl58hvQJ>
>  as BJData to natively support ND-array syntax
>

As said, we are not using packed ND array in msgpack, but rather, using our
own schema.  Blosc2 supports the concept of metalayers for adding new
meaning to the stored data (see
https://www.blosc.org/docs/Caterva-Blosc2-SciPy2019.pdf, slide 17).  One of
these layers is Caterva, where we have added support for MD arrays
<https://github.com/Blosc/caterva/blob/master/CATERVA_METALAYER.rst>.  Note
that our implementation for supporting ND arrays uses two levels of
partitioning (chunks and blocks) for:

1. Allow finer granularity
<https://www.blosc.org/posts/caterva-slicing-perf/> in retrieving data.

2. Better adapt to the memory hierarchies (i.e. main memory and cache
levels in CPU) for efficiency
<https://www.blosc.org/posts/breaking-memory-walls/>.

OTOH, I have noticed that your patch for msgpack
<https://github.com/msgpack/msgpack/pull/267/files#diff-bc6661da34ecae62fbe724bb93fd69b91a7f81143f2683a81163231de7e3b545R334>
 only suggest to use uint32 as the type for array shape.  This would
prevent to use creating arrays where some dim is larger than 2^32.  Is that
intended?

2. a key belief
> <https://pbs.twimg.com/media/FCD_JNtWQAgLq6N?format=png&name=4096x4096> of
> the NeuroJSON project is that "human readability" is the single most
> important factor to decide the longevity of both codes and data. The
> human-readability of codes have been well addressed and reinforced by
> open-source/free/libre software licenses (specifically, Freedom 1
> <https://www.gnu.org/philosophy/free-sw.en.html#make-changes>). but not
> many people have been paying attention to the "readability" of data.
> Admittedly, it is a harder problem. storing data in text files results in
> much larger size and slow speed, so storing binary data in
> application-defined binary files, just like npy, is extremely common.
> However, these binary files in most cases are not directly readable; they
> depend on a marching parser, which carrys the format spec/schema separately
> from the data themselves, to correctly read/write. Because the data files
> are not self-contained, usually not self-documenting, their utility heavily
> depends on the parser writers - when a parser phase out an older format, or
> does not implement the format rigorously, the data ultimately will no
> longer able to be opened and become useless.
>
> One feature that really drew my attention to UBJSON/BJData is that they
> are "quasi-human-readable
> <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md#:~:text=quasi%2Dhuman%2Dreadability>".
> This is rather *unique* among all binary formats. This is because the
> "semantic" elements (data type markers, field names and strings) in
> UBJSON/BJData are all human-readable. Essentially one can open such binary
> file with a text editor and figure out what's inside - if the data file is
> well self-documented (which it permits), then such data can be quickly
> understood without depending on a parser.
>
> you can try this command on the lzma.jdb file
>
> *$ strings -n2 eye5chunk_bjd_lzma.jdb | astyle | sed '/_ArrayZipData_/q'*
> [ {U
>    _ArrayType_SU
>    doubleU
>    _ArraySize_[U
>               ]U
>    _ArrayZipType_SU
>    lzmaU
>    _ArrayZipSize_[U
>                   m@
>                  ]U
>    _ArrayZipData_[$U#uE
>
> as you can see, the subfields of the data (_ArraySize_, _ArrayType_,
> ...), as well as the data markers ([,{,U, S, ...) and string values (
> "double","lzma", ...) are all directly readable. There are garbled text
> in the binary stream that may also be printed to make it hard to read, but
> it's readability is still way better than most other binary files where the
> datafield's meaning/format are completely decoupled to the parser or the
> semantic markers are not human readable (such as in msgpack).
>

I see your point, and your intent is really appreciated.  It is just in the
10's GB and up domain that I see BJData a bit lacking in that text handling
tools (strings, sed, not to mention editors, where you can run out of
memory very soon) can become unnecessarily slow for retrieving the
metainfo.  We really feel that such metainfo should go either at the
beginning or at the end of the frame, where it can be found and processed
way more efficiently.

OTOH, I agree in that msgpack is not human readable directly, but the
format is becoming so ubiquitous that you can find standard tools for
introspecting metadata quite easily:

$ msgpack2json -di eye5_blosc2_blosclz.b2frame
[
    "b2frame\u0000",
    97,
    1012063,
    "\u0012\u0000P\u0000",
    800000000,
    1011729,
    8,
    0,
    16000000,
    8,
    1,
    false,
    <ext type:6 size:16 0000000000010000...>,
    [
        7,
        {},
        []
    ]
]

And, as there are msgpack libraries for almost all of the currently used
languages, I think that formats based on it are as open and transparent as
we can get.


>
> again, I applaud the wonderful works from the blosc2 team and have no
> doubt it has many advantages to offer to sharing array data, on the other
> side, I do want to advocate for considering readability and portability to
> the data files. Essentially the NeuroJSON specs
> <http://neurojson.org/#specs> (JData
> <https://github.com/NeuroJSON/jdata/blob/Draft_2/JData_specification.md>,
> BJData
> <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md>,
> etc) are taking the mission of building a "source-code language" for
> scientific data storage.
>

Thanks, I concur with your work too!  It is always nice to discuss with
people that has put a lot of thought in how to pack data efficiently, and
as simply as possible (but not any simpler!).  Actually, we might be
adopting some aspects of JData <https://github.com/fangq/jdata> to be able
to store different objects (arrays, tables, graphs, trees...) in the same
frame in a future possible extension of Blosc2.  Or, maybe using JData as
the external container for existing Blosc2 frames.  Very interesting
discussion indeed; many possibilities are open now!

Cheers,
Francesc


On Sat, Aug 27, 2022 at 6:17 PM Qianqian Fang <[email protected]> wrote:

> hi Francesc,
>
> wonderful works on blosc2! congrats! this is exactly the direction that I
> would hope more data creators/data users would pay attention to.
>
> clearly blosc2 is a well positioned for high performance - msgpack is one
> of the most proliferated binary JSON formats out there, with many
> extensively optimized libraries; zstd is also a rapidly emerging
> compression class that has a well developed multi-threading support. this
> combination likely has the best that the current toolchain can offer to
> deliver good performance and robustness. The added SIMD and data chunking
> features further push the performance bar.
>
> I am aware that msgpack does not currently support packed ND-array data
> type (see my PR to add this syntax at
> https://github.com/msgpack/msgpack/pull/267), I suppose blosc2 must have
> been using customized  buffers warped under an ext32 container, is that the
> case? or you implemented your own unofficial ext64 type?
>
> I am not surprised to see blosc2 outperforms npz/jdb in compression
> benchmarks because zstd supports multi-threading, that makes a huge
> difference, as shown clearly in this 2017 benchmark that I found online
>
> https://community.centminmod.com/threads/compression-comparison-benchmarks-zstd-vs-brotli-vs-pigz-vs-bzip2-vs-xz-etc.12764/
>
> using the multi-threaded versions of zlib (pigz) and lzma (pxz, pixz, or
> plzip) would be a more apple-to-apple comparison, but I do believe zstd may
> still hold an edge in speed (but may trade for less compression ratio). I
> also noticed that lbzip2 also gives relatively good speed and high
> compression ratio. Nothing beats lzma (lzma/zip/xz) in compression ratio,
> even with the highest setting in zstd.
>
> I absolutely agree with you that different flavors of binary JSON formats
> (Msgpack vs CBOR vs BSON vs UBJSON vs BJData) matters little because they
> are all JSON-convertible and follow the same design principles as JSON -
> namely simplicity, generality and lightweight.
>
> I did make some deliberations when deciding whether to use Msgpack vs
> UBJSON/BJData as the main binary format for NeuroJSON, there were two
> things had steered my decision:
>
> 1. there is *no official packed ND array support* in both Msgpack and
> UBJSON. ND-array is such a fundamental data structure for scientific data
> storage and it has to be the first-class citizen in data serialization
> formats - storing an ND array in nested 1D list, as done in standard
> msgpack/ubjson, not only lose the dimensional regularity but also adds
> overheads and breaks the continuous binary buffer. That was the main reason
> that I had to extend UBJSON
> <https://groups.google.com/g/universal-binary-json/c/tgMCEbOmhes/m/s7JlCl58hvQJ>
> as BJData to natively support ND-array syntax
>
> 2. a key belief
> <https://pbs.twimg.com/media/FCD_JNtWQAgLq6N?format=png&name=4096x4096>
> of the NeuroJSON project is that "human readability" is the single most
> important factor to decide the longevity of both codes and data. The
> human-readability of codes have been well addressed and reinforced by
> open-source/free/libre software licenses (specifically, Freedom 1
> <https://www.gnu.org/philosophy/free-sw.en.html#make-changes>). but not
> many people have been paying attention to the "readability" of data.
> Admittedly, it is a harder problem. storing data in text files results in
> much larger size and slow speed, so storing binary data in
> application-defined binary files, just like npy, is extremely common.
> However, these binary files in most cases are not directly readable; they
> depend on a marching parser, which carrys the format spec/schema separately
> from the data themselves, to correctly read/write. Because the data files
> are not self-contained, usually not self-documenting, their utility heavily
> depends on the parser writers - when a parser phase out an older format, or
> does not implement the format rigorously, the data ultimately will no
> longer able to be opened and become useless.
>
> One feature that really drew my attention to UBJSON/BJData is that they
> are "quasi-human-readable
> <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md#:~:text=quasi%2Dhuman%2Dreadability>".
> This is rather *unique* among all binary formats. This is because the
> "semantic" elements (data type markers, field names and strings) in
> UBJSON/BJData are all human-readable. Essentially one can open such binary
> file with a text editor and figure out what's inside - if the data file is
> well self-documented (which it permits), then such data can be quickly
> understood without depending on a parser.
>
> you can try this command on the lzma.jdb file
>
> *$ strings -n2 eye5chunk_bjd_lzma.jdb | astyle | sed '/_ArrayZipData_/q'*
> [ {U
>    _ArrayType_SU
>    doubleU
>    _ArraySize_[U
>               ]U
>    _ArrayZipType_SU
>    lzmaU
>    _ArrayZipSize_[U
>                   m@
>                  ]U
>    _ArrayZipData_[$U#uE
>
> as you can see, the subfields of the data (_ArraySize_, _ArrayType_,
> ...), as well as the data markers ([,{,U, S, ...) and string values (
> "double","lzma", ...) are all directly readable. There are garbled text
> in the binary stream that may also be printed to make it hard to read, but
> it's readability is still way better than most other binary files where the
> datafield's meaning/format are completely decoupled to the parser or the
> semantic markers are not human readable (such as in msgpack).
>
> again, I applaud the wonderful works from the blosc2 team and have no
> doubt it has many advantages to offer to sharing array data, on the other
> side, I do want to advocate for considering readability and portability to
> the data files. Essentially the NeuroJSON specs
> <http://neurojson.org/#specs> (JData
> <https://github.com/NeuroJSON/jdata/blob/Draft_2/JData_specification.md>,
> BJData
> <https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md>,
> etc) are taking the mission of building a "source-code language" for
> scientific data storage.
>
>
> Qianqian
>
>
> On 8/27/22 04:32, Francesc Alted wrote:
>
> Hi Qianqian,
>
> Your work in bjdata's is very interesting.  Our team (Blosc) has been
> working on something along these lines, and I was curious on how the
> different approaches compares.  In particular, Blosc2 uses the msgpack
> format to store binary data in a flexible way, but in my experience, using
> binary JSON or msgpack is not that important; the real thing is to be able
> to compress data in chunks that fits in CPU caches, and then trust in fast
> codecs and filters for speed.
>
> I have setup a small benchmark (
> https://gist.github.com/FrancescAlted/e4d186404f4c87d9620cb6f89a03ba0d)
> based on your setup and here are my numbers (using an AMD 5950X processor,
> and a fast SSD here):
>
> (python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$
> PYTHONPATH=.. python read-binary-data.py save
> time for creating big array (and splits): 0.009s (86.5 GB/s)
>
> ** Saving data **
> time for saving with npy: 0.450s (1.65 GB/s)
> time for saving with np.memmap: 0.689s (1.08 GB/s)
> time for saving with npz: 1.021s (0.73 GB/s)
> time for saving with jdb (zlib): 4.614s (0.161 GB/s)
> time for saving with jdb (lzma): 11.294s (0.066 GB/s)
> time for saving with blosc2 (blosclz): 0.020s (37.8 GB/s)
> time for saving with blosc2 (zstd): 0.153s (4.87 GB/s)
>
> ** Load and operate **
> time for reducing with plain numpy (memory): 0.016s (47.4 GB/s)
> time for reducing with npy (np.load, no mmap): 0.144s (5.18 GB/s)
> time for reducing with np.memmap: 0.055s (13.6 GB/s)
> time for reducing with npz: 1.808s (0.412 GB/s)
> time for reducing with jdb (zlib): 1.624s (0.459 GB/s)
> time for reducing with jdb (lzma): 0.255s (2.92 GB/s)
> time for reducing with blosc2 (blosclz): 0.042s (17.7 GB/s)
> time for reducing with blosc2 (zstd): 0.070s (10.7 GB/s)
> Total sum: 10000.0
>
> So, it is evident that in this scenario compression can accelerate things
> a lot, specially for compression.  Here are the sizes:
>
> (python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ ll -h eye5*
> -rw-rw-r-- 1 faltet2 faltet2 989K ago 27 09:51 eye5_blosc2_blosclz.b2frame
> -rw-rw-r-- 1 faltet2 faltet2 188K ago 27 09:51 eye5_blosc2_zstd.b2frame
> -rw-rw-r-- 1 faltet2 faltet2 121K ago 27 09:51 eye5chunk_bjd_lzma.jdb
> -rw-rw-r-- 1 faltet2 faltet2 795K ago 27 09:51 eye5chunk_bjd_zlib.jdb
> -rw-rw-r-- 1 faltet2 faltet2 763M ago 27 09:51 eye5chunk-memmap.npy
> -rw-rw-r-- 1 faltet2 faltet2 763M ago 27 09:51 eye5chunk.npy
> -rw-rw-r-- 1 faltet2 faltet2 785K ago 27 09:51 eye5chunk.npz
>
> Regarding decompression, I am quite pleased on how jdb+lzma performs
> (specially with the compression ratio).  But in order to provide a better
> idea on the actual read performance, it is better to evict the files from
> the OS cache.  Also, the benchmark performs some operation on data (in this
> case a reduction) to make sure that all the data is processed.
>
> So, let's evict the files:
>
> (python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ vmtouch -ev
> eye5*
> Evicting eye5_blosc2_blosclz.b2frame
> Evicting eye5_blosc2_zstd.b2frame
> Evicting eye5chunk_bjd_lzma.jdb
> Evicting eye5chunk_bjd_zlib.jdb
> Evicting eye5chunk-memmap.npy
> Evicting eye5chunk.npy
> Evicting eye5chunk.npz
>
>            Files: 7
>      Directories: 0
>    Evicted Pages: 391348 (1G)
>          Elapsed: 0.084441 seconds
>
> And then re-run the benchmark (without re-creating the files indeed):
>
> (python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$
> PYTHONPATH=.. python read-binary-data.py
> time for creating big array (and splits): 0.009s (80.4 GB/s)
>
> ** Load and operate **
> time for reducing with plain numpy (memory): 0.065s (11.5 GB/s)
> time for reducing with npy (np.load, no mmap): 0.413s (1.81 GB/s)
> time for reducing with np.memmap: 0.547s (1.36 GB/s)
> time for reducing with npz: 1.881s (0.396 GB/s)
> time for reducing with jdb (zlib): 1.845s (0.404 GB/s)
> time for reducing with jdb (lzma): 0.204s (3.66 GB/s)
> time for reducing with blosc2 (blosclz): 0.043s (17.2 GB/s)
> time for reducing with blosc2 (zstd): 0.072s (10.4 GB/s)
> Total sum: 10000.0
>
> In this case we can notice that the combination of blosc2+blosclz achieves
> speeds that are faster than using a plain numpy array.  Having disk I/O
> going faster than memory is strange enough, but if we take into account
> that these arrays compress extremely well (more than 1000x in this case),
> then the I/O overhead is really low compared with the cost of computation
> (all the decompression takes place in CPU cache, not memory), so in the
> end, this is not that surprising.
>
> Cheers!
>
>
> On Fri, Aug 26, 2022 at 4:26 AM Qianqian Fang <[email protected]> wrote:
>
>> On 8/25/22 18:33, Neal Becker wrote:
>>
>>
>>
>>> the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy
>>> 1.19.5) for each file is listed below:
>>>
>>> 0.179s  eye1e4.npy (mmap_mode=None)
>>> 0.001s  eye1e4.npy (mmap_mode=r)
>>> 0.718s  eye1e4_bjd_raw_ndsyntax.jdb
>>> 1.474s  eye1e4_bjd_zlib.jdb
>>> 0.635s  eye1e4_bjd_lzma.jdb
>>>
>>>
>>> clearly, mmapped loading is the fastest option without a surprise; it is
>>> true that the raw bjdata file is about 5x slower than npy loading, but
>>> given the main chunk of the data are stored identically (as contiguous
>>> buffer), I suppose with some optimization of the decoder, the gap between
>>> the two can be substantially shortened. The longer loading time of
>>> zlib/lzma (and similarly saving times) reflects a trade-off between smaller
>>> file sizes and time for compression/decompression/disk-IO.
>>>
>>> I think the load time for mmap may be deceptive, it isn't actually
>>> loading anything, just mapping to memory.  Maybe a better benchmark is to
>>> actually process the data, e.g., find the mean which would require reading
>>> the values.
>>
>>
>> yes, that is correct, I meant to metion it wasn't an apple-to-apple
>> comparison.
>>
>> the loading times for fully-loading the data and printing the mean, by
>> running the below line
>>
>> t=time.time(); newy=jd.load('eye1e4_bjd_raw_ndsyntax.jdb');
>> print(np.mean(newy)); t1=time.time() - t; print(t1)
>>
>> are summarized below (I also added lz4 compressed BJData/.jdb file via 
>> jd.save(...,
>> {'compression':'lz4'}))
>>
>> 0.236s  eye1e4.npy (mmap_mode=None) - size: 800000128 bytes
>> 0.120s  eye1e4.npy (mmap_mode=r)
>> 0.764s  eye1e4_bjd_raw_ndsyntax.jdb (with C extension _bjdata in
>> sys.path) - size: 800000014 bytes
>> 0.599s  eye1e4_bjd_raw_ndsyntax.jdb (without C extension _bjdata)
>> 1.533s  eye1e4_bjd_zlib.jdb (without C extension _bjdata)  - size: 813721
>> 0.697s  eye1e4_bjd_lzma.jdb (without C extension _bjdata)  - size: 113067
>> 0.918s  eye1e4_bjd_lz4.jdb (without C extension _bjdata)   - size:
>> 3371487 bytes
>>
>> the mmapped loading remains to be the fastest, but the run-time is more
>> realistic. I thought the lz4 compression would offer much faster
>> decompression, but in this special workload, it isn't the case.
>>
>> It is also interesting to see that the bjdata's C extension
>> <https://github.com/NeuroJSON/pybj/tree/master/src> did not help when
>> parsing a single large array compared to the native python parser,
>> suggesting rooms for further optimization.
>>
>>
>> Qianqian
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>> Member address: [email protected]
>>
>
>
> --
> Francesc Alted
>
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to 
> [email protected]https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: [email protected]
>
>

-- 
Francesc Alted

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]

[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Reply via email to