Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

Benjamin Root Thu, 14 Jan 2016 06:52:02 -0800

A warning about HDF5. It is not a database format, so you have to be
extremely careful if the data is getting updated while it is open for
reading by anybody else. If it is strictly read-only, and no body else is
updating it, then have at it!


Cheers!
Ben Root

On Thu, Jan 14, 2016 at 9:16 AM, Edison Gustavo Muenz <
edisongust...@gmail.com> wrote:

> From what I know this would be the use case that Dask seems to solve.
>
> I think this blog post can help:
> https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python
>
> Notice that I haven't used any of these projects myself.
>
> On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted <fal...@gmail.com> wrote:
>
>> Well, maybe something like a simple class emulating a dictionary that
>> stores a key-value on disk would be more than enough.  Then you can use
>> whatever persistence layer that you want (even HDF5, but not necessarily).
>>
>> As a demonstration I did a quick and dirty implementation for such a
>> persistent key-store thing (
>> https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897).  On it, the
>> KeyStore class (less than 40 lines long) is responsible for storing the
>> value (2 arrays) into a key (a directory).  As I am quite a big fan of
>> compression, I implemented a couple of serialization flavors: one using the
>> .npz format (so no other dependencies than NumPy are needed) and the other
>> using the ctable object from the bcolz package (bcolz.blosc.org).  Here
>> are some performance numbers:
>>
>> python key-store.py -f numpy -d __test -l 0
>> ########## Checking method: numpy (via .npz files) ############
>> Building database.  Wait please...
>> Time (            creation) --> 1.906
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 0.191
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>>
>> 75M     __test
>>
>> So, with the NPZ format we can deal with the 75 MB quite easily.  But NPZ
>> can compress data as well, so let's see how it goes:
>>
>> $ python key-store.py -f numpy -d __test -l 9
>> ########## Checking method: numpy (via .npz files) ############
>> Building database.  Wait please...
>> Time (            creation) --> 6.636
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 0.384
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 28M     __test
>>
>> Ok, in this case we have got almost a 3x compression ratio, which is not
>> bad.  However, the performance has degraded a lot.  Let's use now bcolz.
>> First in non-compressed mode:
>>
>> $ python key-store.py -f bcolz -d __test -l 0
>> ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
>> ############
>> Building database.  Wait please...
>> Time (            creation) --> 0.479
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 0.103
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 82M     __test
>>
>> Without compression, bcolz takes a bit more (~10%) space than NPZ.
>> However, bcolz is actually meant to be used with compression on by default:
>>
>> $ python key-store.py -f bcolz -d __test -l 9
>> ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
>> ############
>> Building database.  Wait please...
>> Time (            creation) --> 0.487
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 0.98
>> Number of elements out of getitem: 10518976
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test
>> 29M     __test
>>
>> So, the final disk usage is quite similar to NPZ, but it can store and
>> retrieve lots faster.  Also, the data decompression speed is on par to
>> using non-compression.  This is because bcolz uses Blosc behind the scenes,
>> which is much faster than zlib (used by NPZ) --and sometimes faster than a
>> memcpy().  However, even we are doing I/O against the disk, this dataset is
>> so small that fits in the OS filesystem cache, so the benchmark is actually
>> checking I/O at memory speeds, not disk speeds.
>>
>> In order to do a more real-life comparison, let's use a dataset that is
>> much larger than the amount of memory in my laptop (8 GB):
>>
>> $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d
>> /media/faltet/docker/__test -l 0
>> ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz')
>> ############
>> Building database.  Wait please...
>> Time (            creation) --> 133.650
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 2.881
>> Number of elements out of getitem: 91907396
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
>> /media/faltet/docker/__test
>>
>> 39G     /media/faltet/docker/__test
>>
>> and now, with compression on:
>>
>> $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d
>> /media/faltet/docker/__test -l 9
>> ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz')
>> ############
>> Building database.  Wait please...
>> Time (            creation) --> 145.633
>> Retrieving 100 keys in arbitrary order...
>> Time (               query) --> 1.339
>> Number of elements out of getitem: 91907396
>> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh
>> /media/faltet/docker/__test
>>
>> 12G     /media/faltet/docker/__test
>>
>> So, we are still seeing the 3x compression ratio.  But the interesting
>> thing here is that the compressed version works a 50% faster than the
>> uncompressed one (13 ms/query vs 29 ms/query).  In this case I was using a
>> SSD (hence the low query times), so the compression advantage is even more
>> noticeable than when using memory as above (as expected).
>>
>> But anyway, this is just a demonstration that you don't need heavy tools
>> to achieve what you want.  And as a corollary, (fast) compressors can save
>> you not only storage, but processing time too.
>>
>> Francesc
>>
>>
>> 2016-01-14 11:19 GMT+01:00 Nathaniel Smith <n...@pobox.com>:
>>
>>> I'd try storing the data in hdf5 (probably via h5py, which is a more
>>> basic interface without all the bells-and-whistles that pytables
>>> adds), though any method you use is going to be limited by the need to
>>> do a seek before each read. Storing the data on SSD will probably help
>>> a lot if you can afford it for your data size.
>>>
>>> On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <r...@bytemining.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > I have a very large dictionary that must be shared across processes
>>> and does not fit in RAM. I need access to this object to be fast. The key
>>> is an integer ID and the value is a list containing two elements, both of
>>> them numpy arrays (one has ints, the other has floats). The key is
>>> sequential, starts at 0, and there are no gaps, so the “outer” layer of
>>> this data structure could really just be a list with the key actually being
>>> the index. The lengths of each pair of arrays may differ across keys.
>>> >
>>> > For a visual:
>>> >
>>> > {
>>> > key=0:
>>> >         [
>>> >                 numpy.array([1,8,15,…, 16000]),
>>> >                 numpy.array([0.1,0.1,0.1,…,0.1])
>>> >         ],
>>> > key=1:
>>> >         [
>>> >                 numpy.array([5,6]),
>>> >                 numpy.array([0.5,0.5])
>>> >         ],
>>> > …
>>> > }
>>> >
>>> > I’ve tried:
>>> > -       manager proxy objects, but the object was so big that
>>> low-level code threw an exception due to format and monkey-patching wasn’t
>>> successful.
>>> > -       Redis, which was far too slow due to setting up connections
>>> and data conversion etc.
>>> > -       Numpy rec arrays + memory mapping, but there is a restriction
>>> that the numpy arrays in each “column” must be of fixed and same size.
>>> > -       I looked at PyTables, which may be a solution, but seems to
>>> have a very steep learning curve.
>>> > -       I haven’t tried SQLite3, but I am worried about the time it
>>> takes to query the DB for a sequential ID, and then translate byte arrays.
>>> >
>>> > Any ideas? I greatly appreciate any guidance you can provide.
>>> >
>>> > Thanks,
>>> > Ryan
>>> > _______________________________________________
>>> > NumPy-Discussion mailing list
>>> > NumPy-Discussion@scipy.org
>>> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>
>>>
>>> --
>>> Nathaniel J. Smith -- http://vorpus.org
>>> _______________________________________________
>>> NumPy-Discussion mailing list
>>> NumPy-Discussion@scipy.org
>>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>
>>
>>
>> --
>> Francesc Alted
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

Reply via email to