Hi Qianqian,

Your work in bjdata's is very interesting.  Our team (Blosc) has been
working on something along these lines, and I was curious on how the
different approaches compares.  In particular, Blosc2 uses the msgpack
format to store binary data in a flexible way, but in my experience, using
binary JSON or msgpack is not that important; the real thing is to be able
to compress data in chunks that fits in CPU caches, and then trust in fast
codecs and filters for speed.

I have setup a small benchmark (
https://gist.github.com/FrancescAlted/e4d186404f4c87d9620cb6f89a03ba0d)
based on your setup and here are my numbers (using an AMD 5950X processor,
and a fast SSD here):

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ PYTHONPATH=..
python read-binary-data.py save
time for creating big array (and splits): 0.009s (86.5 GB/s)

** Saving data **
time for saving with npy: 0.450s (1.65 GB/s)
time for saving with np.memmap: 0.689s (1.08 GB/s)
time for saving with npz: 1.021s (0.73 GB/s)
time for saving with jdb (zlib): 4.614s (0.161 GB/s)
time for saving with jdb (lzma): 11.294s (0.066 GB/s)
time for saving with blosc2 (blosclz): 0.020s (37.8 GB/s)
time for saving with blosc2 (zstd): 0.153s (4.87 GB/s)

** Load and operate **
time for reducing with plain numpy (memory): 0.016s (47.4 GB/s)
time for reducing with npy (np.load, no mmap): 0.144s (5.18 GB/s)
time for reducing with np.memmap: 0.055s (13.6 GB/s)
time for reducing with npz: 1.808s (0.412 GB/s)
time for reducing with jdb (zlib): 1.624s (0.459 GB/s)
time for reducing with jdb (lzma): 0.255s (2.92 GB/s)
time for reducing with blosc2 (blosclz): 0.042s (17.7 GB/s)
time for reducing with blosc2 (zstd): 0.070s (10.7 GB/s)
Total sum: 10000.0

So, it is evident that in this scenario compression can accelerate things a
lot, specially for compression.  Here are the sizes:

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ ll -h eye5*
-rw-rw-r-- 1 faltet2 faltet2 989K ago 27 09:51 eye5_blosc2_blosclz.b2frame
-rw-rw-r-- 1 faltet2 faltet2 188K ago 27 09:51 eye5_blosc2_zstd.b2frame
-rw-rw-r-- 1 faltet2 faltet2 121K ago 27 09:51 eye5chunk_bjd_lzma.jdb
-rw-rw-r-- 1 faltet2 faltet2 795K ago 27 09:51 eye5chunk_bjd_zlib.jdb
-rw-rw-r-- 1 faltet2 faltet2 763M ago 27 09:51 eye5chunk-memmap.npy
-rw-rw-r-- 1 faltet2 faltet2 763M ago 27 09:51 eye5chunk.npy
-rw-rw-r-- 1 faltet2 faltet2 785K ago 27 09:51 eye5chunk.npz

Regarding decompression, I am quite pleased on how jdb+lzma performs
(specially with the compression ratio).  But in order to provide a better
idea on the actual read performance, it is better to evict the files from
the OS cache.  Also, the benchmark performs some operation on data (in this
case a reduction) to make sure that all the data is processed.

So, let's evict the files:

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ vmtouch -ev
eye5*
Evicting eye5_blosc2_blosclz.b2frame
Evicting eye5_blosc2_zstd.b2frame
Evicting eye5chunk_bjd_lzma.jdb
Evicting eye5chunk_bjd_zlib.jdb
Evicting eye5chunk-memmap.npy
Evicting eye5chunk.npy
Evicting eye5chunk.npz

           Files: 7
     Directories: 0
   Evicted Pages: 391348 (1G)
         Elapsed: 0.084441 seconds

And then re-run the benchmark (without re-creating the files indeed):

(python-blosc2) faltet2@ryzen16:~/blosc/python-blosc2/bench$ PYTHONPATH=..
python read-binary-data.py
time for creating big array (and splits): 0.009s (80.4 GB/s)

** Load and operate **
time for reducing with plain numpy (memory): 0.065s (11.5 GB/s)
time for reducing with npy (np.load, no mmap): 0.413s (1.81 GB/s)
time for reducing with np.memmap: 0.547s (1.36 GB/s)
time for reducing with npz: 1.881s (0.396 GB/s)
time for reducing with jdb (zlib): 1.845s (0.404 GB/s)
time for reducing with jdb (lzma): 0.204s (3.66 GB/s)
time for reducing with blosc2 (blosclz): 0.043s (17.2 GB/s)
time for reducing with blosc2 (zstd): 0.072s (10.4 GB/s)
Total sum: 10000.0

In this case we can notice that the combination of blosc2+blosclz achieves
speeds that are faster than using a plain numpy array.  Having disk I/O
going faster than memory is strange enough, but if we take into account
that these arrays compress extremely well (more than 1000x in this case),
then the I/O overhead is really low compared with the cost of computation
(all the decompression takes place in CPU cache, not memory), so in the
end, this is not that surprising.

Cheers!


On Fri, Aug 26, 2022 at 4:26 AM Qianqian Fang <fan...@gmail.com> wrote:

> On 8/25/22 18:33, Neal Becker wrote:
>
>
>
>> the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy
>> 1.19.5) for each file is listed below:
>>
>> 0.179s  eye1e4.npy (mmap_mode=None)
>> 0.001s  eye1e4.npy (mmap_mode=r)
>> 0.718s  eye1e4_bjd_raw_ndsyntax.jdb
>> 1.474s  eye1e4_bjd_zlib.jdb
>> 0.635s  eye1e4_bjd_lzma.jdb
>>
>>
>> clearly, mmapped loading is the fastest option without a surprise; it is
>> true that the raw bjdata file is about 5x slower than npy loading, but
>> given the main chunk of the data are stored identically (as contiguous
>> buffer), I suppose with some optimization of the decoder, the gap between
>> the two can be substantially shortened. The longer loading time of
>> zlib/lzma (and similarly saving times) reflects a trade-off between smaller
>> file sizes and time for compression/decompression/disk-IO.
>>
>> I think the load time for mmap may be deceptive, it isn't actually
>> loading anything, just mapping to memory.  Maybe a better benchmark is to
>> actually process the data, e.g., find the mean which would require reading
>> the values.
>
>
> yes, that is correct, I meant to metion it wasn't an apple-to-apple
> comparison.
>
> the loading times for fully-loading the data and printing the mean, by
> running the below line
>
> t=time.time(); newy=jd.load('eye1e4_bjd_raw_ndsyntax.jdb');
> print(np.mean(newy)); t1=time.time() - t; print(t1)
>
> are summarized below (I also added lz4 compressed BJData/.jdb file via 
> jd.save(...,
> {'compression':'lz4'}))
>
> 0.236s  eye1e4.npy (mmap_mode=None) - size: 800000128 bytes
> 0.120s  eye1e4.npy (mmap_mode=r)
> 0.764s  eye1e4_bjd_raw_ndsyntax.jdb (with C extension _bjdata in
> sys.path) - size: 800000014 bytes
> 0.599s  eye1e4_bjd_raw_ndsyntax.jdb (without C extension _bjdata)
> 1.533s  eye1e4_bjd_zlib.jdb (without C extension _bjdata)  - size: 813721
> 0.697s  eye1e4_bjd_lzma.jdb (without C extension _bjdata)  - size: 113067
> 0.918s  eye1e4_bjd_lz4.jdb (without C extension _bjdata)   - size:
> 3371487 bytes
>
> the mmapped loading remains to be the fastest, but the run-time is more
> realistic. I thought the lz4 compression would offer much faster
> decompression, but in this special workload, it isn't the case.
>
> It is also interesting to see that the bjdata's C extension
> <https://github.com/NeuroJSON/pybj/tree/master/src> did not help when
> parsing a single large array compared to the native python parser,
> suggesting rooms for further optimization.
>
>
> Qianqian
>
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: fal...@gmail.com
>


-- 
Francesc Alted
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to