Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

Edison Gustavo Muenz Thu, 14 Jan 2016 06:17:05 -0800

>From what I know this would be the use case that Dask seems to solve.

I think this blog post can help:
https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python


Notice that I haven't used any of these projects myself.

On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted <fal...@gmail.com> wrote:

> Well, maybe something like a simple class emulating a dictionary that
> stores a key-value on disk would be more than enough.  Then you can use
> whatever persistence layer that you want (even HDF5, but not necessarily).
>
> As a demonstration I did a quick and dirty implementation for such a
> persistent key-store thing (
> https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897).  On it, the
> KeyStore class (less than 40 lines long) is responsible for storing the
> value (2 arrays) into a key (a directory).  As I am quite a big fan of
> compression, I implemented a couple of serialization flavors: one using the
> .npz format (so no other dependencies than NumPy are needed) and the other
> using the ctable object from the bcolz package (bcolz.blosc.org).  Here
> are some performance numbers:
>
> python key-store.py -f numpy -d __test -l 0
> ########## Checking method: numpy (via .npz files) ############
> Building database.  Wait please...
> Time (            creation) --> 1.906
> Retrieving 100 keys in arbitrary order...
> Time (               query) --> 0.191
> Number of elements out of getitem: 10518976
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>
> 75M     __test
>
> So, with the NPZ format we can deal with the 75 MB quite easily.  But NPZ
> can compress data as well, so let's see how it goes:
>
> $ python key-store.py -f numpy -d __test -l 9
> ########## Checking method: numpy (via .npz files) ############
> Building database.  Wait please...
> Time (            creation) --> 6.636
> Retrieving 100 keys in arbitrary order...
> Time (               query) --> 0.384
> Number of elements out of getitem: 10518976
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
> 28M     __test
>
> Ok, in this case we have got almost a 3x compression ratio, which is not
> bad.  However, the performance has degraded a lot.  Let's use now bcolz.
> First in non-compressed mode:
>
> $ python key-store.py -f bcolz -d __test -l 0
> ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
> ############
> Building database.  Wait please...
> Time (            creation) --> 0.479
> Retrieving 100 keys in arbitrary order...
> Time (               query) --> 0.103
> Number of elements out of getitem: 10518976
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
> 82M     __test
>
> Without compression, bcolz takes a bit more (~10%) space than NPZ.
> However, bcolz is actually meant to be used with compression on by default:
>
> $ python key-store.py -f bcolz -d __test -l 9
> ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
> ############
> Building database.  Wait please...
> Time (            creation) --> 0.487
> Retrieving 100 keys in arbitrary order...
> Time (               query) --> 0.98
> Number of elements out of getitem: 10518976
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
> 29M     __test
>
> So, the final disk usage is quite similar to NPZ, but it can store and
> retrieve lots faster.  Also, the data decompression speed is on par to
> using non-compression.  This is because bcolz uses Blosc behind the scenes,
> which is much faster than zlib (used by NPZ) --and sometimes faster than a
> memcpy().  However, even we are doing I/O against the disk, this dataset is
> so small that fits in the OS filesystem cache, so the benchmark is actually
> checking I/O at memory speeds, not disk speeds.
>
> In order to do a more real-life comparison, let's use a dataset that is
> much larger than the amount of memory in my laptop (8 GB):
>
> $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d
> /media/faltet/docker/__test -l 0
> ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
> ############
> Building database.  Wait please...
> Time (            creation) --> 133.650
> Retrieving 100 keys in arbitrary order...
> Time (               query) --> 2.881
> Number of elements out of getitem: 91907396
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
> /media/faltet/docker/__test
>
> 39G     /media/faltet/docker/__test
>
> and now, with compression on:
>
> $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d
> /media/faltet/docker/__test -l 9
> ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
> ############
> Building database.  Wait please...
> Time (            creation) --> 145.633
> Retrieving 100 keys in arbitrary order...
> Time (               query) --> 1.339
> Number of elements out of getitem: 91907396
> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
> /media/faltet/docker/__test
>
> 12G     /media/faltet/docker/__test
>
> So, we are still seeing the 3x compression ratio.  But the interesting
> thing here is that the compressed version works a 50% faster than the
> uncompressed one (13 ms/query vs 29 ms/query).  In this case I was using a
> SSD (hence the low query times), so the compression advantage is even more
> noticeable than when using memory as above (as expected).
>
> But anyway, this is just a demonstration that you don't need heavy tools
> to achieve what you want.  And as a corollary, (fast) compressors can save
> you not only storage, but processing time too.
>
> Francesc
>
>
> 2016-01-14 11:19 GMT+01:00 Nathaniel Smith <n...@pobox.com>:
>
>> I'd try storing the data in hdf5 (probably via h5py, which is a more
>> basic interface without all the bells-and-whistles that pytables
>> adds), though any method you use is going to be limited by the need to
>> do a seek before each read. Storing the data on SSD will probably help
>> a lot if you can afford it for your data size.
>>
>> On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <r...@bytemining.com>
>> wrote:
>> > Hi,
>> >
>> > I have a very large dictionary that must be shared across processes and
>> does not fit in RAM. I need access to this object to be fast. The key is an
>> integer ID and the value is a list containing two elements, both of them
>> numpy arrays (one has ints, the other has floats). The key is sequential,
>> starts at 0, and there are no gaps, so the “outer” layer of this data
>> structure could really just be a list with the key actually being the
>> index. The lengths of each pair of arrays may differ across keys.
>> >
>> > For a visual:
>> >
>> > {
>> > key=0:
>> >         [
>> >                 numpy.array([1,8,15,…, 16000]),
>> >                 numpy.array([0.1,0.1,0.1,…,0.1])
>> >         ],
>> > key=1:
>> >         [
>> >                 numpy.array([5,6]),
>> >                 numpy.array([0.5,0.5])
>> >         ],
>> > …
>> > }
>> >
>> > I’ve tried:
>> > -       manager proxy objects, but the object was so big that low-level
>> code threw an exception due to format and monkey-patching wasn’t successful.
>> > -       Redis, which was far too slow due to setting up connections and
>> data conversion etc.
>> > -       Numpy rec arrays + memory mapping, but there is a restriction
>> that the numpy arrays in each “column” must be of fixed and same size.
>> > -       I looked at PyTables, which may be a solution, but seems to
>> have a very steep learning curve.
>> > -       I haven’t tried SQLite3, but I am worried about the time it
>> takes to query the DB for a sequential ID, and then translate byte arrays.
>> >
>> > Any ideas? I greatly appreciate any guidance you can provide.
>> >
>> > Thanks,
>> > Ryan
>> > _______________________________________________
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>>
>> --
>> Nathaniel J. Smith -- http://vorpus.org
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
>
> --
> Francesc Alted
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

Reply via email to