>From what I know this would be the use case that Dask seems to solve. I think this blog post can help: https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python
Notice that I haven't used any of these projects myself. On Thu, Jan 14, 2016 at 11:48 AM, Francesc Alted <fal...@gmail.com> wrote: > Well, maybe something like a simple class emulating a dictionary that > stores a key-value on disk would be more than enough. Then you can use > whatever persistence layer that you want (even HDF5, but not necessarily). > > As a demonstration I did a quick and dirty implementation for such a > persistent key-store thing ( > https://gist.github.com/FrancescAlted/8e87c8762a49cf5fc897). On it, the > KeyStore class (less than 40 lines long) is responsible for storing the > value (2 arrays) into a key (a directory). As I am quite a big fan of > compression, I implemented a couple of serialization flavors: one using the > .npz format (so no other dependencies than NumPy are needed) and the other > using the ctable object from the bcolz package (bcolz.blosc.org). Here > are some performance numbers: > > python key-store.py -f numpy -d __test -l 0 > ########## Checking method: numpy (via .npz files) ############ > Building database. Wait please... > Time ( creation) --> 1.906 > Retrieving 100 keys in arbitrary order... > Time ( query) --> 0.191 > Number of elements out of getitem: 10518976 > faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test > > 75M __test > > So, with the NPZ format we can deal with the 75 MB quite easily. But NPZ > can compress data as well, so let's see how it goes: > > $ python key-store.py -f numpy -d __test -l 9 > ########## Checking method: numpy (via .npz files) ############ > Building database. Wait please... > Time ( creation) --> 6.636 > Retrieving 100 keys in arbitrary order... > Time ( query) --> 0.384 > Number of elements out of getitem: 10518976 > faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test > 28M __test > > Ok, in this case we have got almost a 3x compression ratio, which is not > bad. However, the performance has degraded a lot. Let's use now bcolz. > First in non-compressed mode: > > $ python key-store.py -f bcolz -d __test -l 0 > ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') > ############ > Building database. Wait please... > Time ( creation) --> 0.479 > Retrieving 100 keys in arbitrary order... > Time ( query) --> 0.103 > Number of elements out of getitem: 10518976 > faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test > 82M __test > > Without compression, bcolz takes a bit more (~10%) space than NPZ. > However, bcolz is actually meant to be used with compression on by default: > > $ python key-store.py -f bcolz -d __test -l 9 > ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') > ############ > Building database. Wait please... > Time ( creation) --> 0.487 > Retrieving 100 keys in arbitrary order... > Time ( query) --> 0.98 > Number of elements out of getitem: 10518976 > faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh __test > 29M __test > > So, the final disk usage is quite similar to NPZ, but it can store and > retrieve lots faster. Also, the data decompression speed is on par to > using non-compression. This is because bcolz uses Blosc behind the scenes, > which is much faster than zlib (used by NPZ) --and sometimes faster than a > memcpy(). However, even we are doing I/O against the disk, this dataset is > so small that fits in the OS filesystem cache, so the benchmark is actually > checking I/O at memory speeds, not disk speeds. > > In order to do a more real-life comparison, let's use a dataset that is > much larger than the amount of memory in my laptop (8 GB): > > $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d > /media/faltet/docker/__test -l 0 > ########## Checking method: bcolz (via ctable(clevel=0, cname='blosclz') > ############ > Building database. Wait please... > Time ( creation) --> 133.650 > Retrieving 100 keys in arbitrary order... > Time ( query) --> 2.881 > Number of elements out of getitem: 91907396 > faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh > /media/faltet/docker/__test > > 39G /media/faltet/docker/__test > > and now, with compression on: > > $ PYTHONPATH=. python key-store.py -f bcolz -m 1000000 -k 5000 -d > /media/faltet/docker/__test -l 9 > ########## Checking method: bcolz (via ctable(clevel=9, cname='blosclz') > ############ > Building database. Wait please... > Time ( creation) --> 145.633 > Retrieving 100 keys in arbitrary order... > Time ( query) --> 1.339 > Number of elements out of getitem: 91907396 > faltet@faltet-Latitude-E6430:~/blosc/bcolz$ du -sh > /media/faltet/docker/__test > > 12G /media/faltet/docker/__test > > So, we are still seeing the 3x compression ratio. But the interesting > thing here is that the compressed version works a 50% faster than the > uncompressed one (13 ms/query vs 29 ms/query). In this case I was using a > SSD (hence the low query times), so the compression advantage is even more > noticeable than when using memory as above (as expected). > > But anyway, this is just a demonstration that you don't need heavy tools > to achieve what you want. And as a corollary, (fast) compressors can save > you not only storage, but processing time too. > > Francesc > > > 2016-01-14 11:19 GMT+01:00 Nathaniel Smith <n...@pobox.com>: > >> I'd try storing the data in hdf5 (probably via h5py, which is a more >> basic interface without all the bells-and-whistles that pytables >> adds), though any method you use is going to be limited by the need to >> do a seek before each read. Storing the data on SSD will probably help >> a lot if you can afford it for your data size. >> >> On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <r...@bytemining.com> >> wrote: >> > Hi, >> > >> > I have a very large dictionary that must be shared across processes and >> does not fit in RAM. I need access to this object to be fast. The key is an >> integer ID and the value is a list containing two elements, both of them >> numpy arrays (one has ints, the other has floats). The key is sequential, >> starts at 0, and there are no gaps, so the “outer” layer of this data >> structure could really just be a list with the key actually being the >> index. The lengths of each pair of arrays may differ across keys. >> > >> > For a visual: >> > >> > { >> > key=0: >> > [ >> > numpy.array([1,8,15,…, 16000]), >> > numpy.array([0.1,0.1,0.1,…,0.1]) >> > ], >> > key=1: >> > [ >> > numpy.array([5,6]), >> > numpy.array([0.5,0.5]) >> > ], >> > … >> > } >> > >> > I’ve tried: >> > - manager proxy objects, but the object was so big that low-level >> code threw an exception due to format and monkey-patching wasn’t successful. >> > - Redis, which was far too slow due to setting up connections and >> data conversion etc. >> > - Numpy rec arrays + memory mapping, but there is a restriction >> that the numpy arrays in each “column” must be of fixed and same size. >> > - I looked at PyTables, which may be a solution, but seems to >> have a very steep learning curve. >> > - I haven’t tried SQLite3, but I am worried about the time it >> takes to query the DB for a sequential ID, and then translate byte arrays. >> > >> > Any ideas? I greatly appreciate any guidance you can provide. >> > >> > Thanks, >> > Ryan >> > _______________________________________________ >> > NumPy-Discussion mailing list >> > NumPy-Discussion@scipy.org >> > https://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> >> -- >> Nathaniel J. Smith -- http://vorpus.org >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> https://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > > > -- > Francesc Alted > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion