A warning about HDF5. It is not a database format, so you have to be extremely careful if the data is getting updated while it is open for reading by anybody else. If it is strictly read-only, and no body else is updating it, then have at it!
Cheers! Ben Root On Thu, Jan 14, 2016 at 9:16 AM, Edison Gustavo Muenz < edisongust...@gmail.com> wrote: > From what I know this would be the use case that Dask seems to solve. > > I think this blog post can help: > https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python > > Notice that I haven't used any of these projects myself. > > On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted <fal...@gmail.com> wrote: > >> Well, maybe something like a simple class emulating a dictionary that >> stores a key-value on disk would be more than enough. Then you can use >> whatever persistence layer that you want (even HDF5, but not necessarily). >> >> As a demonstration I did a quick and dirty implementation for such a >> persistent key-store thing ( >> https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897). On it, the >> KeyStore class (less than 40 lines long) is responsible for storing the >> value (2 arrays) into a key (a directory). As I am quite a big fan of >> compression, I implemented a couple of serialization flavors: one using the >> .npz format (so no other dependencies than NumPy are needed) and the other >> using the ctable object from the bcolz package (bcolz.blosc.org). Here >> are some performance numbers: >> >> python key-store.py -f numpy -d __test -l 0 >> ########## Checking method: numpy (via .npz files) ############ >> Building database. Wait please... >> Time ( creation) --> 1.906 >> Retrieving 100 keys in arbitrary order... >> Time ( query) --> 0.191 >> Number of elements out of getitem: 10518976 >> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test >> >> 75M __test >> >> So, with the NPZ format we can deal with the 75 MB quite easily. But NPZ >> can compress data as well, so let's see how it goes: >> >> $ python key-store.py -f numpy -d __test -l 9 >> ########## Checking method: numpy (via .npz files) ############ >> Building database. Wait please... >> Time ( creation) --> 6.636 >> Retrieving 100 keys in arbitrary order... >> Time ( query) --> 0.384 >> Number of elements out of getitem: 10518976 >> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test >> 28M __test >> >> Ok, in this case we have got almost a 3x compression ratio, which is not >> bad. However, the performance has degraded a lot. Let's use now bcolz. >> First in non-compressed mode: >> >> $ python key-store.py -f bcolz -d __test -l 0 >> ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') >> ############ >> Building database. Wait please... >> Time ( creation) --> 0.479 >> Retrieving 100 keys in arbitrary order... >> Time ( query) --> 0.103 >> Number of elements out of getitem: 10518976 >> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test >> 82M __test >> >> Without compression, bcolz takes a bit more (~10%) space than NPZ. >> However, bcolz is actually meant to be used with compression on by default: >> >> $ python key-store.py -f bcolz -d __test -l 9 >> ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') >> ############ >> Building database. Wait please... >> Time ( creation) --> 0.487 >> Retrieving 100 keys in arbitrary order... >> Time ( query) --> 0.98 >> Number of elements out of getitem: 10518976 >> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test >> 29M __test >> >> So, the final disk usage is quite similar to NPZ, but it can store and >> retrieve lots faster. Also, the data decompression speed is on par to >> using non-compression. This is because bcolz uses Blosc behind the scenes, >> which is much faster than zlib (used by NPZ) --and sometimes faster than a >> memcpy(). However, even we are doing I/O against the disk, this dataset is >> so small that fits in the OS filesystem cache, so the benchmark is actually >> checking I/O at memory speeds, not disk speeds. >> >> In order to do a more real-life comparison, let's use a dataset that is >> much larger than the amount of memory in my laptop (8 GB): >> >> $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d >> /media/faltet/docker/__test -l 0 >> ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') >> ############ >> Building database. Wait please... >> Time ( creation) --> 133.650 >> Retrieving 100 keys in arbitrary order... >> Time ( query) --> 2.881 >> Number of elements out of getitem: 91907396 >> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh >> /media/faltet/docker/__test >> >> 39G /media/faltet/docker/__test >> >> and now, with compression on: >> >> $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d >> /media/faltet/docker/__test -l 9 >> ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') >> ############ >> Building database. Wait please... >> Time ( creation) --> 145.633 >> Retrieving 100 keys in arbitrary order... >> Time ( query) --> 1.339 >> Number of elements out of getitem: 91907396 >> faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh >> /media/faltet/docker/__test >> >> 12G /media/faltet/docker/__test >> >> So, we are still seeing the 3x compression ratio. But the interesting >> thing here is that the compressed version works a 50% faster than the >> uncompressed one (13 ms/query vs 29 ms/query). In this case I was using a >> SSD (hence the low query times), so the compression advantage is even more >> noticeable than when using memory as above (as expected). >> >> But anyway, this is just a demonstration that you don't need heavy tools >> to achieve what you want. And as a corollary, (fast) compressors can save >> you not only storage, but processing time too. >> >> Francesc >> >> >> 2016-01-14 11:19 GMT+01:00 Nathaniel Smith <n...@pobox.com>: >> >>> I'd try storing the data in hdf5 (probably via h5py, which is a more >>> basic interface without all the bells-and-whistles that pytables >>> adds), though any method you use is going to be limited by the need to >>> do a seek before each read. Storing the data on SSD will probably help >>> a lot if you can afford it for your data size. >>> >>> On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <r...@bytemining.com> >>> wrote: >>> > Hi, >>> > >>> > I have a very large dictionary that must be shared across processes >>> and does not fit in RAM. I need access to this object to be fast. The key >>> is an integer ID and the value is a list containing two elements, both of >>> them numpy arrays (one has ints, the other has floats). The key is >>> sequential, starts at 0, and there are no gaps, so the “outer” layer of >>> this data structure could really just be a list with the key actually being >>> the index. The lengths of each pair of arrays may differ across keys. >>> > >>> > For a visual: >>> > >>> > { >>> > key=0: >>> > [ >>> > numpy.array([1,8,15,…, 16000]), >>> > numpy.array([0.1,0.1,0.1,…,0.1]) >>> > ], >>> > key=1: >>> > [ >>> > numpy.array([5,6]), >>> > numpy.array([0.5,0.5]) >>> > ], >>> > … >>> > } >>> > >>> > I’ve tried: >>> > - manager proxy objects, but the object was so big that >>> low-level code threw an exception due to format and monkey-patching wasn’t >>> successful. >>> > - Redis, which was far too slow due to setting up connections >>> and data conversion etc. >>> > - Numpy rec arrays + memory mapping, but there is a restriction >>> that the numpy arrays in each “column” must be of fixed and same size. >>> > - I looked at PyTables, which may be a solution, but seems to >>> have a very steep learning curve. >>> > - I haven’t tried SQLite3, but I am worried about the time it >>> takes to query the DB for a sequential ID, and then translate byte arrays. >>> > >>> > Any ideas? I greatly appreciate any guidance you can provide. >>> > >>> > Thanks, >>> > Ryan >>> > _______________________________________________ >>> > NumPy-Discussion mailing list >>> > NumPy-Discussion@scipy.org >>> > https://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >>> >>> -- >>> Nathaniel J. Smith -- http://vorpus.org >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> https://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >> >> >> >> -- >> Francesc Alted >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion