On Nov 8, 2007 10:13 PM, Hans Meine <[EMAIL PROTECTED]> wrote: > Am Donnerstag, 08. November 2007 13:44:59 schrieb David Cournapeau: > > Hans Meine wrote: > > > I wonder why simple elementwise operations like "a * 2" or "a + 1" are > > > not performed in order of increasing memory addresses in order to exploit > > > CPU caches etc. - as it is now, their speed drops by a factor of around 3 > > > simply by transpose()ing. > > > > Because it is not trivial to do so in all cases, I guess. It is a > > problem which comes back time to time on the ML, but AFAIK, nobody had a > > fix for it. Fundamentally, for many element-wise operations, either you > > have to implement the thing for every possible case, or you abstract it > > through an iterator, which gives you a decrease of performances in some > > cases. > > OK, so far I have not looked at the NumPy internals, but I would expect sth. > like the following: > > [elementwise operation on array 'a'] > 1) ii = [i for i, s in sorted(enumerate(ia.strides), key = lambda (i, s): -s)] > 2) aprime = a.transpose(ii) # aprime will be "as C-contiguous as it gets" > 3) bprime = perform_operation(aprime) # fast elementwise operation > 4) jj = [ii.index(i) for i in range(bprime.ndim)] # inverse ii > 5) b = bprime.transpose(jj) # let result have the same order as the input The problem is not F vs C storage: for element-wise operation, it does not matter at all; you just apply the same function (perform_operation) over and over on every element of the array. The order does not matter at all.
But what if you have segmented buffers, non aligned, etc... arrays ? All this has to be taken care of, and this means handling reference count and other things which are always delicate to handle well... Or you just use the current situation, which let python handle it (through PyObject_Call and a callable, at a C level). I agree that in some cases, it would be useful to have better performances: in particular, having fast code paths for common cases (aligned, and single segmented) would be nice. One example is the PyArray_Clip function (in multiarray.c), whose speed was too slow for my taste in my use cases: the general idea is to get to the point where you have a C single buffer at which point you can call a trivial C function (fast_clip). As you can see, this is not trivial (~150 lines of C code): since the point is to go faster, you really want to avoid copying things, which means you have to be pretty careful. Speed-wise, it definitely worths it, though: I don't remember the exact numbers, but in some cases (which are very common), it was a several times speed. > > Apart from the fact that this is more or less pseudo-code and that the last > step brings a behaviour change (which is intended, but probably not very > welcome), am I overlooking a problem? > > > > Similarly (but even less logical), copy() and even the > > > constructor are affected (yes, I understand that copy() creates > > > contiguous arrays, but shouldn't it respect/retain the order > > > nevertheless?): > > > > I don't see why it is illogical: when you do a copy, you don't preserve > > memory layout, and so a simple memcpy of the whole buffer is not possible. > > As I wrote above, I don't think this is good. A fortran-order-contiguous > array is still contiguous, and not inferior in any way to C-order arrays. > So I actually expect copy() to return an array of unchanged order. Maybe this behaviour was kept for compatibility with numarray ? If you look at the docstring, it is said that copy may not return the same order than its input. It kind of make sense to me: C order is the default of many numpy operations. David _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion