On Wed, Jul 22, 2009 at 5:11 AM, Francesc Alted<fal...@pytables.org> wrote:
> A Tuesday 21 July 2009 23:07:28 Kenneth Arnold escrigué:
>> (Use case: we have a read-only matrix that's an array of vectors.
>> Given a probe vector, we want to find the top n vectors closest to it,
>> measured by dot product. numpy's dot function does exactly what we
>> want. But this runs in a multiprocess server, and these matrices are
>> largeish, so I thought memmap would be a good way to let the OS handle
>> sharing the matrix between the processes.)
>>
>> (Array _columns_ are stored contiguously, right?)
>
> No.  PyTables (as HDF5, NumPy and all the C world in general) stores the
> arrays in a row-wise manner, not column-wise (like for example Matlab which is
> based on Fortran).

I meant that within a row, entries from different columns are stored
contiguously. So we vehemently agree, though your terminology is
probably better.

> Well, if done properly, I/O in PyTables should not take much more than
> numpy.memmap (in fact, it can be faster in many occasions).  You just need to
> read/write arrays following the contiguous direction, i.e. the most to the
> right among those orthogonal to the 'main' dimension (in PyTables jargon).

I think we are.

The slow part of our code is a single line of Python:
`numpy.dot(matrix, vector)`. `matrix` is a tall (thousands by 50)
PyTables Array or CArray. `vector` is a row sliced out of it (1 by
50). PyTables performs extremely poorly in this case (70 seconds
compared to 0.04 seconds for NumPy ndarrays). Here is the profiler
output (%prun), Python 2.5, NumPy 1.3:

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  285592   25.140    0.000   25.140    0.000 {method '_g_readSlice' of
'tables.hdf5Extension.Array' objects}
  285592   15.211    0.000   27.501    0.000 array.py:407(_interpret_indexing)
  285592    3.439    0.000   66.873    0.000 array.py:486(__getitem__)
  285592    3.164    0.000    8.295    0.000 leaf.py:425(_processRange)
  856776    2.902    0.000    4.487    0.000 utils.py:66(idx2long)
       1    2.876    2.876   69.749   69.749 {numpy.core.multiarray.dot}
  285592    2.719    0.000   27.859    0.000 array.py:565(_readSlice)
 1142368    2.257    0.000    2.257    0.000 utils.py:44(is_idx)
 1427963    1.782    0.000    1.782    0.000 {len}
  285592    1.636    0.000    4.492    0.000 flavor.py:344(conv_to_numpy)
  285592    1.261    0.000    1.261    0.000 numeric.py:180(asarray)
  285592    1.232    0.000    5.724    0.000 flavor.py:110(array_of_flavor2)
  571186    1.130    0.000    1.130    0.000 {isinstance}
  285592    1.069    0.000    2.329    0.000 flavor.py:353(_conv_numpy_to_numpy)
  285592    0.920    0.000    6.644    0.000 flavor.py:130(flavor_to_flavor)
  285592    0.896    0.000    7.540    0.000 flavor.py:150(internal_to_flavor)
  285592    0.644    0.000    0.644    0.000 {tables.utilsExtension.getIndices}
  285592    0.535    0.000    0.535    0.000 leaf.py:243(<lambda>)
  285593    0.527    0.000    0.527    0.000 {hasattr}
  285592    0.411    0.000    0.411    0.000 {method 'append' of 'list' objects}
       1    0.000    0.000   69.750   69.750 labeled_view.py:242(dot)
       5    0.000    0.000    0.000    0.000 array.py:123(_getnrows)
       1    0.000    0.000    0.000    0.000 labeled_view.py:93(__init__)

(labeled_view.py is the only part that's our code, and it only calls
numpy.core.multiarray.dot once.)

-Ken

------------------------------------------------------------------------------
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to