Hi Ken,

A Tuesday 21 July 2009 23:07:28 Kenneth Arnold escrigué:
> (Re-raising an issue that was brought up last year: [1])
>
> Since the Enthought webinar on memmap-ing numpy arrays[2] suggested
> PyTables for creating new files (see slide 30 at [3]), I assumed by
> association that PyTables mem-mapped the data also. I switched an
> algorithm that kept data in memory over to use PyTables, and sure
> enough memory usage dropped dramatically, but now coming back to it, I
> find that performance took a big hit. Upon closer investigation, no,
> PyTables doesn't mmap. Oops.
>
> (Use case: we have a read-only matrix that's an array of vectors.
> Given a probe vector, we want to find the top n vectors closest to it,
> measured by dot product. numpy's dot function does exactly what we
> want. But this runs in a multiprocess server, and these matrices are
> largeish, so I thought memmap would be a good way to let the OS handle
> sharing the matrix between the processes.)
>
> (Array _columns_ are stored contiguously, right?)

No.  PyTables (as HDF5, NumPy and all the C world in general) stores the 
arrays in a row-wise manner, not column-wise (like for example Matlab which is 
based on Fortran).

>
> Since PyTables doesn't currently do what I thought it did, we'll
> probably move to using memmapped ndarrays directly, as the webinar
> describes. But the natural question is, could PyTables possibly do
> what I thought it could? It might be very hard to handle compressed
> data, but uncompressed data seems possible; if the data is contiguous
> in the HDF5 file, all we really need is a way to get that data in
> memory, or at least its offset into the file. Poking around the HDF5
> api[4], I don't see an obvious way to do that, but I do wonder if
> anyone has given it any thought.

Well, if done properly, I/O in PyTables should not take much more than 
numpy.memmap (in fact, it can be faster in many occasions).  You just need to 
read/write arrays following the contiguous direction, i.e. the most to the 
right among those orthogonal to the 'main' dimension (in PyTables jargon).

For PyTables 2.2 I've an expression evaluator (tables.Expr) in the works that 
will be able to compute element-wise expressions with arbitrarily large 
arrays.  It cannot deal with linear algebra yet, but with some patience, this 
could be done efficiently too.

See the post at [1] and my answers for how to efficiently operate with large 
PyTables arrays on-disk.  Also, I'm working on a new compressor [2] that would 
accelerate computations with respect Zlib or LZO.  Hopefully, it will be out 
by forthcoming PyTables 2.2.

[1] http://www.mail-archive.com/numpy-discuss...@scipy.org/msg18863.html
[2] http://www.euroscipy.org/presentations/abstracts/abstract_alted.html

Hope that helps,

-- 
Francesc Alted

------------------------------------------------------------------------------
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to