Hi Ryan,

Did you consider packing the arrays into one(two) giant array stored with mmap?

That way you only need to store the start & end offsets, and there is
no need to use a dictionary.
It may allow you to simplify some numerical operations as well.

To be more specific,

start : numpy.intp
end : numpy.intp

data1 : numpy.int32
data2 : numpy.float64

Then your original access to the dictionary can be rewritten as

data1[start[key]:end[key]
data2[start[key]:end[key]

Whether to wrap this as a dictionary-like object is just a matter of
taste -- depending you like it raw or fine.

If you need to apply some global transformation to the data, then
something like data2[...] *= 10 would work.

ufunc.reduceat(data1, ....) can be very useful as well. (with some
tricks on start /end)

I was facing a similar issue a few years ago, and you may want to look
at this code (It wasn't very well written I had to admit)

https://github.com/rainwoodman/gaepsi/blob/master/gaepsi/tools/__init__.py#L362

Best,

- Yu

On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario <r...@bytemining.com> wrote:
> Hi,
>
> I have a very large dictionary that must be shared across processes and does 
> not fit in RAM. I need access to this object to be fast. The key is an 
> integer ID and the value is a list containing two elements, both of them 
> numpy arrays (one has ints, the other has floats). The key is sequential, 
> starts at 0, and there are no gaps, so the “outer” layer of this data 
> structure could really just be a list with the key actually being the index. 
> The lengths of each pair of arrays may differ across keys.
>
> For a visual:
>
> {
> key=0:
>         [
>                 numpy.array([1,8,15,…, 16000]),
>                 numpy.array([0.1,0.1,0.1,…,0.1])
>         ],
> key=1:
>         [
>                 numpy.array([5,6]),
>                 numpy.array([0.5,0.5])
>         ],
> …
> }
>
> I’ve tried:
> -       manager proxy objects, but the object was so big that low-level code 
> threw an exception due to format and monkey-patching wasn’t successful.
> -       Redis, which was far too slow due to setting up connections and data 
> conversion etc.
> -       Numpy rec arrays + memory mapping, but there is a restriction that 
> the numpy arrays in each “column” must be of fixed and same size.
> -       I looked at PyTables, which may be a solution, but seems to have a 
> very steep learning curve.
> -       I haven’t tried SQLite3, but I am worried about the time it takes to 
> query the DB for a sequential ID, and then translate byte arrays.
>
> Any ideas? I greatly appreciate any guidance you can provide.
>
> Thanks,
> Ryan
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to