Hi Ryan, Did you consider packing the arrays into one(two) giant array stored with mmap?
That way you only need to store the start & end offsets, and there is no need to use a dictionary. It may allow you to simplify some numerical operations as well. To be more specific, start : numpy.intp end : numpy.intp data1 : numpy.int32 data2 : numpy.float64 Then your original access to the dictionary can be rewritten as data1[start[key]:end[key] data2[start[key]:end[key] Whether to wrap this as a dictionary-like object is just a matter of taste -- depending you like it raw or fine. If you need to apply some global transformation to the data, then something like data2[...] *= 10 would work. ufunc.reduceat(data1, ....) can be very useful as well. (with some tricks on start /end) I was facing a similar issue a few years ago, and you may want to look at this code (It wasn't very well written I had to admit) https://github.com/rainwoodman/gaepsi/blob/master/gaepsi/tools/__init__.py#L362 Best, - Yu On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <r...@bytemining.com> wrote: > Hi, > > I have a very large dictionary that must be shared across processes and does > not fit in RAM. I need access to this object to be fast. The key is an > integer ID and the value is a list containing two elements, both of them > numpy arrays (one has ints, the other has floats). The key is sequential, > starts at 0, and there are no gaps, so the “outer” layer of this data > structure could really just be a list with the key actually being the index. > The lengths of each pair of arrays may differ across keys. > > For a visual: > > { > key=0: > [ > numpy.array([1,8,15,…, 16000]), > numpy.array([0.1,0.1,0.1,…,0.1]) > ], > key=1: > [ > numpy.array([5,6]), > numpy.array([0.5,0.5]) > ], > … > } > > I’ve tried: > - manager proxy objects, but the object was so big that low-level code > threw an exception due to format and monkey-patching wasn’t successful. > - Redis, which was far too slow due to setting up connections and data > conversion etc. > - Numpy rec arrays + memory mapping, but there is a restriction that > the numpy arrays in each “column” must be of fixed and same size. > - I looked at PyTables, which may be a solution, but seems to have a > very steep learning curve. > - I haven’t tried SQLite3, but I am worried about the time it takes to > query the DB for a sequential ID, and then translate byte arrays. > > Any ideas? I greatly appreciate any guidance you can provide. > > Thanks, > Ryan > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion