On Wed, Jul 22, 2009 at 5:11 AM, Francesc Alted<fal...@pytables.org> wrote: > A Tuesday 21 July 2009 23:07:28 Kenneth Arnold escrigué: >> (Use case: we have a read-only matrix that's an array of vectors. >> Given a probe vector, we want to find the top n vectors closest to it, >> measured by dot product. numpy's dot function does exactly what we >> want. But this runs in a multiprocess server, and these matrices are >> largeish, so I thought memmap would be a good way to let the OS handle >> sharing the matrix between the processes.) >> >> (Array _columns_ are stored contiguously, right?) > > No. PyTables (as HDF5, NumPy and all the C world in general) stores the > arrays in a row-wise manner, not column-wise (like for example Matlab which is > based on Fortran).
I meant that within a row, entries from different columns are stored contiguously. So we vehemently agree, though your terminology is probably better. > Well, if done properly, I/O in PyTables should not take much more than > numpy.memmap (in fact, it can be faster in many occasions). You just need to > read/write arrays following the contiguous direction, i.e. the most to the > right among those orthogonal to the 'main' dimension (in PyTables jargon). I think we are. The slow part of our code is a single line of Python: `numpy.dot(matrix, vector)`. `matrix` is a tall (thousands by 50) PyTables Array or CArray. `vector` is a row sliced out of it (1 by 50). PyTables performs extremely poorly in this case (70 seconds compared to 0.04 seconds for NumPy ndarrays). Here is the profiler output (%prun), Python 2.5, NumPy 1.3: ncalls tottime percall cumtime percall filename:lineno(function) 285592 25.140 0.000 25.140 0.000 {method '_g_readSlice' of 'tables.hdf5Extension.Array' objects} 285592 15.211 0.000 27.501 0.000 array.py:407(_interpret_indexing) 285592 3.439 0.000 66.873 0.000 array.py:486(__getitem__) 285592 3.164 0.000 8.295 0.000 leaf.py:425(_processRange) 856776 2.902 0.000 4.487 0.000 utils.py:66(idx2long) 1 2.876 2.876 69.749 69.749 {numpy.core.multiarray.dot} 285592 2.719 0.000 27.859 0.000 array.py:565(_readSlice) 1142368 2.257 0.000 2.257 0.000 utils.py:44(is_idx) 1427963 1.782 0.000 1.782 0.000 {len} 285592 1.636 0.000 4.492 0.000 flavor.py:344(conv_to_numpy) 285592 1.261 0.000 1.261 0.000 numeric.py:180(asarray) 285592 1.232 0.000 5.724 0.000 flavor.py:110(array_of_flavor2) 571186 1.130 0.000 1.130 0.000 {isinstance} 285592 1.069 0.000 2.329 0.000 flavor.py:353(_conv_numpy_to_numpy) 285592 0.920 0.000 6.644 0.000 flavor.py:130(flavor_to_flavor) 285592 0.896 0.000 7.540 0.000 flavor.py:150(internal_to_flavor) 285592 0.644 0.000 0.644 0.000 {tables.utilsExtension.getIndices} 285592 0.535 0.000 0.535 0.000 leaf.py:243(<lambda>) 285593 0.527 0.000 0.527 0.000 {hasattr} 285592 0.411 0.000 0.411 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 69.750 69.750 labeled_view.py:242(dot) 5 0.000 0.000 0.000 0.000 array.py:123(_getnrows) 1 0.000 0.000 0.000 0.000 labeled_view.py:93(__init__) (labeled_view.py is the only part that's our code, and it only calls numpy.core.multiarray.dot once.) -Ken ------------------------------------------------------------------------------ _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users