[Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

Ryan R. Rosario Thu, 14 Jan 2016 01:15:21 -0800

Hi,

I have a very large dictionary that must be shared across processes and does 
not fit in RAM. I need access to this object to be fast. The key is an integer 
ID and the value is a list containing two elements, both of them numpy arrays 
(one has ints, the other has floats). The key is sequential, starts at 0, and 
there are no gaps, so the “outer” layer of this data structure could really 
just be a list with the key actually being the index. The lengths of each pair 
of arrays may differ across keys.


For a visual:

{
key=0:
        [
                numpy.array([1,8,15,…, 16000]),
                numpy.array([0.1,0.1,0.1,…,0.1])
        ],
key=1:
        [
                numpy.array([5,6]),
                numpy.array([0.5,0.5])
        ],
…
}

I’ve tried:
-       manager proxy objects, but the object was so big that low-level code 
threw an exception due to format and monkey-patching wasn’t successful. 
-       Redis, which was far too slow due to setting up connections and data 
conversion etc.
-       Numpy rec arrays + memory mapping, but there is a restriction that the 
numpy arrays in each “column” must be of fixed and same size.
-       I looked at PyTables, which may be a solution, but seems to have a very 
steep learning curve.
-       I haven’t tried SQLite3, but I am worried about the time it takes to 
query the DB for a sequential ID, and then translate byte arrays.

Any ideas? I greatly appreciate any guidance you can provide.

Thanks,
Ryan
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

Reply via email to