[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Qianqian Fang Sat, 27 Aug 2022 09:20:39 -0700

hi Francesc,

wonderful works on blosc2! congrats! this is exactly the direction thatI would hope more data creators/data users would pay attention to.

clearly blosc2 is a well positioned for high performance - msgpack isone of the most proliferated binary JSON formats out there, with manyextensively optimized libraries; zstd is also a rapidly emergingcompression class that has a well developed multi-threading support.this combination likely has the best that the current toolchain canoffer to deliver good performance and robustness. The added SIMD anddata chunking features further push the performance bar.

I am aware that msgpack does not currently support packed ND-array datatype (see my PR to add this syntax athttps://github.com/msgpack/msgpack/pull/267), I suppose blosc2 must havebeen using customized buffers warped under an ext32 container, is thatthe case? or you implemented your own unofficial ext64 type?

I am not surprised to see blosc2 outperforms npz/jdb in compressionbenchmarks because zstd supports multi-threading, that makes a hugedifference, as shown clearly in this 2017 benchmark that I found online

https://community.centminmod.com/threads/compression-comparison-benchmarks-zstd-vs-brotli-vs-pigz-vs-bzip2-vs-xz-etc.12764/

using the multi-threaded versions of zlib (pigz) and lzma (pxz, pixz, orplzip) would be a more apple-to-apple comparison, but I do believe zstdmay still hold an edge in speed (but may trade for less compressionratio). I also noticed that lbzip2 also gives relatively good speed andhigh compression ratio. Nothing beats lzma (lzma/zip/xz) in compressionratio, even with the highest setting in zstd.

I absolutely agree with you that different flavors of binary JSONformats (Msgpack vs CBOR vs BSON vs UBJSON vs BJData) matters littlebecause they are all JSON-convertible and follow the same designprinciples as JSON - namely simplicity, generality and lightweight.

I did make some deliberations when deciding whether to use Msgpack vsUBJSON/BJData as the main binary format for NeuroJSON, there were twothings had steered my decision:

1. there is *no official packed ND array support* in both Msgpack andUBJSON. ND-array is such a fundamental data structure for scientificdata storage and it has to be the first-class citizen in dataserialization formats - storing an ND array in nested 1D list, as donein standard msgpack/ubjson, not only lose the dimensional regularity butalso adds overheads and breaks the continuous binary buffer. That wasthe main reason that I had to extend UBJSON<https://groups.google.com/g/universal-binary-json/c/tgMCEbOmhes/m/s7JlCl58hvQJ>as BJData to natively support ND-array syntax

2. a key belief<https://pbs.twimg.com/media/FCD_JNtWQAgLq6N?format=png&name=4096x4096>of the NeuroJSON project is that "human readability" is the single mostimportant factor to decide the longevity of both codes and data. Thehuman-readability of codes have been well addressed and reinforced byopen-source/free/libre software licenses (specifically, Freedom 1<https://www.gnu.org/philosophy/free-sw.en.html#make-changes>). but notmany people have been paying attention to the "readability" of data.Admittedly, it is a harder problem. storing data in text files resultsin much larger size and slow speed, so storing binary data inapplication-defined binary files, just like npy, is extremely common.However, these binary files in most cases are not directly readable;they depend on a marching parser, which carrys the format spec/schemaseparately from the data themselves, to correctly read/write. Becausethe data files are not self-contained, usually not self-documenting,their utility heavily depends on the parser writers - when a parserphase out an older format, or does not implement the format rigorously,the data ultimately will no longer able to be opened and become useless.

One feature that really drew my attention to UBJSON/BJData is that theyare "quasi-human-readable<https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md#:~:text=quasi%2Dhuman%2Dreadability>".This is rather *unique* among all binary formats. This is because the"semantic" elements (data type markers, field names and strings) inUBJSON/BJData are all human-readable. Essentially one can open suchbinary file with a text editor and figure out what's inside - if thedata file is well self-documented (which it permits), then such data canbe quickly understood without depending on a parser.


you can try this command on the lzma.jdb file

*|$ strings -n2 eye5chunk_bjd_lzma.jdb | astyle | sed '/_ArrayZipData_/q'|*|
||[ {U||
||   _ArrayType_SU||
||   doubleU||
||   _ArraySize_[U||
||              ]U||
||   _ArrayZipType_SU||
||   lzmaU||
||   _ArrayZipSize_[U||
||                  m@||
||                 ]U||
||   _ArrayZipData_[$U#uE||
|

as you can see, the subfields of the data (|_ArraySize_, _ArrayType_|,...), as well as the data markers (|[,{,U, S, ...|) and string values(|"double","lzma"|, ...) are all directly readable. There are garbledtext in the binary stream that may also be printed to make it hard toread, but it's readability is still way better than most other binaryfiles where the datafield's meaning/format are completely decoupled tothe parser or the semantic markers are not human readable (such as inmsgpack).

again, I applaud the wonderful works from the blosc2 team and have nodoubt it has many advantages to offer to sharing array data, on theother side, I do want to advocate for considering readability andportability to the data files. Essentially the NeuroJSON specs<http://neurojson.org/#specs> (JData<https://github.com/NeuroJSON/jdata/blob/Draft_2/JData_specification.md>,BJData<https://github.com/NeuroJSON/bjdata/blob/Draft_2/Binary_JData_Specification.md>,etc) are taking the mission of building a "source-code language" forscientific data storage.



Qianqian


On 8/27/22 04:32, Francesc Alted wrote:

Hi Qianqian,

Your work in bjdata's is very interesting. Our team (Blosc) has beenworking on something along these lines, and I was curious on how thedifferent approaches compares. In particular, Blosc2 uses the msgpackformat to store binary data in a flexible way, but in my experience,using binary JSON or msgpack is not that important; the real thing isto be able to compress data in chunks that fits in CPU caches, andthen trust in fast codecs and filters for speed.

I have setup a small benchmark(https://gist.github.com/FrancescAlted/e4d186404f4c87d9620cb6f89a03ba0d)based on your setup and here are my numbers (using an AMD 5950Xprocessor, and a fast SSD here):

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$PYTHONPATH=.. python read-binary-data.py save

time for creating big array (and splits): 0.009s (86.5 GB/s)

** Saving data **
time for saving with npy: 0.450s (1.65 GB/s)
time for saving with np.memmap: 0.689s (1.08 GB/s)
time for saving with npz: 1.021s (0.73 GB/s)
time for saving with jdb (zlib): 4.614s (0.161 GB/s)
time for saving with jdb (lzma): 11.294s (0.066 GB/s)
time for saving with blosc2 (blosclz): 0.020s (37.8 GB/s)
time for saving with blosc2 (zstd): 0.153s (4.87 GB/s)

** Load and operate **
time for reducing with plain numpy (memory): 0.016s (47.4 GB/s)
time for reducing with npy (np.load, no mmap): 0.144s (5.18 GB/s)
time for reducing with np.memmap: 0.055s (13.6 GB/s)
time for reducing with npz: 1.808s (0.412 GB/s)
time for reducing with jdb (zlib): 1.624s (0.459 GB/s)
time for reducing with jdb (lzma): 0.255s (2.92 GB/s)
time for reducing with blosc2 (blosclz): 0.042s (17.7 GB/s)
time for reducing with blosc2 (zstd): 0.070s (10.7 GB/s)
Total sum: 10000.0

So, it is evident that in this scenario compression can acceleratethings a lot, specially for compression. Here are the sizes:


(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ ll -h eye5*
-rw-rw-r-- 1 faltet2 faltet2 989K ago 27 09:51 eye5_blosc2_blosclz.b2frame
-rw-rw-r-- 1 faltet2 faltet2 188K ago 27 09:51 eye5_blosc2_zstd.b2frame
-rw-rw-r-- 1 faltet2 faltet2 121K ago 27 09:51 eye5chunk_bjd_lzma.jdb
-rw-rw-r-- 1 faltet2 faltet2 795K ago 27 09:51 eye5chunk_bjd_zlib.jdb
-rw-rw-r-- 1 faltet2 faltet2 763M ago 27 09:51 eye5chunk-memmap.npy
-rw-rw-r-- 1 faltet2 faltet2 763M ago 27 09:51 eye5chunk.npy
-rw-rw-r-- 1 faltet2 faltet2 785K ago 27 09:51 eye5chunk.npz

Regarding decompression, I am quite pleased on how jdb+lzma performs(specially with the compression ratio). But in order to provide abetter idea on the actual read performance, it is better to evict thefiles from the OS cache. Also, the benchmark performs some operationon data (in this case a reduction) to make sure that all the data isprocessed.


So, let's evict the files:

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ vmtouch-ev eye5*

Evicting eye5_blosc2_blosclz.b2frame
Evicting eye5_blosc2_zstd.b2frame
Evicting eye5chunk_bjd_lzma.jdb
Evicting eye5chunk_bjd_zlib.jdb
Evicting eye5chunk-memmap.npy
Evicting eye5chunk.npy
Evicting eye5chunk.npz

           Files: 7
     Directories: 0
   Evicted Pages: 391348 (1G)
         Elapsed: 0.084441 seconds

And then re-run the benchmark (without re-creating the files indeed):

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$PYTHONPATH=.. python read-binary-data.py

time for creating big array (and splits): 0.009s (80.4 GB/s)

** Load and operate **
time for reducing with plain numpy (memory): 0.065s (11.5 GB/s)
time for reducing with npy (np.load, no mmap): 0.413s (1.81 GB/s)
time for reducing with np.memmap: 0.547s (1.36 GB/s)
time for reducing with npz: 1.881s (0.396 GB/s)
time for reducing with jdb (zlib): 1.845s (0.404 GB/s)
time for reducing with jdb (lzma): 0.204s (3.66 GB/s)
time for reducing with blosc2 (blosclz): 0.043s (17.2 GB/s)
time for reducing with blosc2 (zstd): 0.072s (10.4 GB/s)
Total sum: 10000.0

In this case we can notice that the combination of blosc2+blosclzachieves speeds that are faster than using a plain numpy array. Having disk I/O going faster than memory is strange enough, but if wetake into account that these arrays compress extremely well (more than1000x in this case), then the I/O overhead is really low compared withthe cost of computation (all the decompression takes place in CPUcache, not memory), so in the end, this is not that surprising.


Cheers!


On Fri, Aug 26, 2022 at 4:26 AM Qianqian Fang <[email protected]> wrote:

    On 8/25/22 18:33, Neal Becker wrote:


        the loading time (from an nvme drive, Ubuntu 18.04, python
        3.6.9, numpy 1.19.5) for each file is listed below:

        |0.179s  eye1e4.npy (mmap_mode=None)||
        ||0.001s  eye1e4.npy (mmap_mode=r)||
        ||0.718s  eye1e4_bjd_raw_ndsyntax.jdb||
        ||1.474s  eye1e4_bjd_zlib.jdb||
        ||0.635s  eye1e4_bjd_lzma.jdb|


        clearly, mmapped loading is the fastest option without a
        surprise; it is true that the raw bjdata file is about 5x
        slower than npy loading, but given the main chunk of the
        data are stored identically (as contiguous buffer), I
        suppose with some optimization of the decoder, the gap
        between the two can be substantially shortened. The longer
        loading time of zlib/lzma (and similarly saving times)
        reflects a trade-off between smaller file sizes and time for
        compression/decompression/disk-IO.

        I think the load time for mmap may be deceptive, it isn't
        actually loading anything, just mapping to memory.  Maybe a
        better benchmark is to actually process the data, e.g., find
        the mean which would require reading the values.


    yes, that is correct, I meant to metion it wasn't an
    apple-to-apple comparison.

    the loading times for fully-loading the data and printing the
    mean, by running the below line

    |t=time.time(); newy=jd.load('eye1e4_bjd_raw_ndsyntax.jdb');
    print(np.mean(newy)); t1=time.time() - t; print(t1)|

    are summarized below (I also added lz4 compressed BJData/.jdb file
    via |jd.save(..., {'compression':'lz4'})|)

    |0.236s  eye1e4.npy (mmap_mode=None)||- size: 800000128 bytes
    ||0.120s  eye1e4.npy (mmap_mode=r)||
    ||0.764s  eye1e4_bjd_raw_ndsyntax.jdb||(with C extension _bjdata
    in sys.path) - size: 800000014 bytes|
    ||0.599s  eye1e4_bjd_raw_ndsyntax.jdb||(without C extension _bjdata)|
    ||1.533s  eye1e4_bjd_zlib.jdb|||(without C extension _bjdata)||| 
    - size: 813721
    ||0.697s  eye1e4_bjd_lzma.jdb|||(without C extension _bjdata)  -
    size: 113067
    |||||0.918s eye1e4_bjd_lz4.jdb|||(without C extension _bjdata)   -
    size: 3371487 bytes||
    ||||

    the mmapped loading remains to be the fastest, but the run-time is
    more realistic. I thought the lz4 compression would offer much
    faster decompression, but in this special workload, it isn't the case.

    It is also interesting to see that the bjdata's C extension
    <https://github.com/NeuroJSON/pybj/tree/master/src> did not help
    when parsing a single large array compared to the native python
    parser, suggesting rooms for further optimization|.|||
    ||

    ||
    ||

    ||Qianqian||

    ||
    ||

    _______________________________________________
    NumPy-Discussion mailing list -- [email protected]
    To unsubscribe send an email to [email protected]
    https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
    Member address: [email protected]



--
Francesc Alted

_______________________________________________
NumPy-Discussion mailing list [email protected]
To unsubscribe send an email [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address:[email protected]

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]

[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Reply via email to