[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Qianqian Fang Thu, 25 Aug 2022 19:27:08 -0700

On 8/25/22 18:33, Neal Becker wrote:


    the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9,
    numpy 1.19.5) for each file is listed below:

    |0.179s  eye1e4.npy (mmap_mode=None)||
    ||0.001s  eye1e4.npy (mmap_mode=r)||
    ||0.718s  eye1e4_bjd_raw_ndsyntax.jdb||
    ||1.474s  eye1e4_bjd_zlib.jdb||
    ||0.635s  eye1e4_bjd_lzma.jdb|


    clearly, mmapped loading is the fastest option without a
    surprise; it is true that the raw bjdata file is about 5x slower
    than npy loading, but given the main chunk of the data are stored
    identically (as contiguous buffer), I suppose with some
    optimization of the decoder, the gap between the two can be
    substantially shortened. The longer loading time of zlib/lzma
    (and similarly saving times) reflects a trade-off between smaller
    file sizes and time for compression/decompression/disk-IO.

    I think the load time for mmap may be deceptive, it isn't actually
    loading anything, just mapping to memory.  Maybe a better
    benchmark is to actually process the data, e.g., find the mean
    which would require reading the values.

yes, that is correct, I meant to metion it wasn't an apple-to-applecomparison.

the loading times for fully-loading the data and printing the mean, byrunning the below line

|t=time.time(); newy=jd.load('eye1e4_bjd_raw_ndsyntax.jdb');print(np.mean(newy)); t1=time.time() - t; print(t1)|

are summarized below (I also added lz4 compressed BJData/.jdb file via|jd.save(..., {'compression':'lz4'})|)


|0.236s  eye1e4.npy (mmap_mode=None)||- size: 800000128 bytes
||0.120s  eye1e4.npy (mmap_mode=r)||

||0.764s eye1e4_bjd_raw_ndsyntax.jdb||(with C extension _bjdata insys.path) - size: 800000014 bytes|

||0.599s  eye1e4_bjd_raw_ndsyntax.jdb||(without C extension _bjdata)|

||1.533s eye1e4_bjd_zlib.jdb|||(without C extension _bjdata)||| -size: 813721||0.697s eye1e4_bjd_lzma.jdb|||(without C extension _bjdata) - size:113067|||||0.918s eye1e4_bjd_lz4.jdb|||(without C extension _bjdata) -size: 3371487 bytes||

||||

the mmapped loading remains to be the fastest, but the run-time is morerealistic. I thought the lz4 compression would offer much fasterdecompression, but in this special workload, it isn't the case.

It is also interesting to see that the bjdata's C extension<https://github.com/NeuroJSON/pybj/tree/master/src> did not help whenparsing a single large array compared to the native python parser,suggesting rooms for further optimization|.|||

||

||
||

||Qianqian||

||
||

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

Reply via email to