Hi Ken, A Tuesday 21 July 2009 23:07:28 Kenneth Arnold escrigué: > (Re-raising an issue that was brought up last year: [1]) > > Since the Enthought webinar on memmap-ing numpy arrays[2] suggested > PyTables for creating new files (see slide 30 at [3]), I assumed by > association that PyTables mem-mapped the data also. I switched an > algorithm that kept data in memory over to use PyTables, and sure > enough memory usage dropped dramatically, but now coming back to it, I > find that performance took a big hit. Upon closer investigation, no, > PyTables doesn't mmap. Oops. > > (Use case: we have a read-only matrix that's an array of vectors. > Given a probe vector, we want to find the top n vectors closest to it, > measured by dot product. numpy's dot function does exactly what we > want. But this runs in a multiprocess server, and these matrices are > largeish, so I thought memmap would be a good way to let the OS handle > sharing the matrix between the processes.) > > (Array _columns_ are stored contiguously, right?)
No. PyTables (as HDF5, NumPy and all the C world in general) stores the arrays in a row-wise manner, not column-wise (like for example Matlab which is based on Fortran). > > Since PyTables doesn't currently do what I thought it did, we'll > probably move to using memmapped ndarrays directly, as the webinar > describes. But the natural question is, could PyTables possibly do > what I thought it could? It might be very hard to handle compressed > data, but uncompressed data seems possible; if the data is contiguous > in the HDF5 file, all we really need is a way to get that data in > memory, or at least its offset into the file. Poking around the HDF5 > api[4], I don't see an obvious way to do that, but I do wonder if > anyone has given it any thought. Well, if done properly, I/O in PyTables should not take much more than numpy.memmap (in fact, it can be faster in many occasions). You just need to read/write arrays following the contiguous direction, i.e. the most to the right among those orthogonal to the 'main' dimension (in PyTables jargon). For PyTables 2.2 I've an expression evaluator (tables.Expr) in the works that will be able to compute element-wise expressions with arbitrarily large arrays. It cannot deal with linear algebra yet, but with some patience, this could be done efficiently too. See the post at [1] and my answers for how to efficiently operate with large PyTables arrays on-disk. Also, I'm working on a new compressor [2] that would accelerate computations with respect Zlib or LZO. Hopefully, it will be out by forthcoming PyTables 2.2. [1] http://www.mail-archive.com/numpy-discuss...@scipy.org/msg18863.html [2] http://www.euroscipy.org/presentations/abstracts/abstract_alted.html Hope that helps, -- Francesc Alted ------------------------------------------------------------------------------ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users