Hans Meine wrote: > Hi! > > I wonder why simple elementwise operations like "a * 2" or "a + 1" are not > performed in order of increasing memory addresses in order to exploit CPU > caches etc. - as it is now, their speed drops by a factor of around 3 simply > by transpose()ing. Because it is not trivial to do so in all cases, I guess. It is a problem which comes back time to time on the ML, but AFAIK, nobody had a fix for it. Fundamentally, for many element-wise operations, either you have to implement the thing for every possible case, or you abstract it through an iterator, which gives you a decrease of performances in some cases. There are also cases where the current implementation is far from optimal, for lack of man power I guess (taking a look at PyArray_Mean, for example, shows that it uses PyArray_GenericReduceFunction, which is really slow compare to a straight C implementation). > Similarly (but even less logical), copy() and even the > constructor are affected (yes, I understand that copy() creates contiguous > arrays, but shouldn't it respect/retain the order nevertheless?): > I don't see why it is illogical: when you do a copy, you don't preserve memory layout, and so a simple memcpy of the whole buffer is not possible.
cheers, David > ### constructor ### > In [89]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3)) > 1000000 loops, best of 10: 1.19 s per loop > > In [90]: %timeit -r 10 -n 1000000 numpy.ndarray((3,3,3), order="f") > 1000000 loops, best of 10: 2.19 s per loop > > ### copy 3x3x3 array ### > In [85]: a = numpy.ndarray((3,3,3)) > > In [86]: %timeit -r 10 a.copy() > 1000000 loops, best of 10: 1.14 s per loop > > In [87]: a = numpy.ndarray((3,3,3), order="f") > > In [88]: %timeit -r 10 -n 1000000 a.copy() > 1000000 loops, best of 10: 3.39 s per loop > > ### copy 256x256x256 array ### > In [74]: a = numpy.ndarray((256,256,256)) > > In [75]: %timeit -r 10 a.copy() > 10 loops, best of 10: 119 ms per loop > > In [76]: a = numpy.ndarray((256,256,256), order="f") > > In [77]: %timeit -r 10 a.copy() > 10 loops, best of 10: 274 ms per loop > > ### fill ### > In [79]: a = numpy.ndarray((256,256,256)) > > In [80]: %timeit -r 10 a.fill(0) > 10 loops, best of 10: 60.2 ms per loop > > In [81]: a = numpy.ndarray((256,256,256), order="f") > > In [82]: %timeit -r 10 a.fill(0) > 10 loops, best of 10: 60.2 ms per loop > > ### power ### > In [151]: a = numpy.ndarray((256,256,256)) > > In [152]: %timeit -r 10 a ** 2 > 10 loops, best of 10: 124 ms per loop > > In [153]: a = numpy.asfortranarray(a) > > In [154]: %timeit -r 10 a ** 2 > 10 loops, best of 10: 458 ms per loop > > ### addition ### > In [160]: a = numpy.ndarray((256,256,256)) > > In [161]: %timeit -r 10 a + 1 > 10 loops, best of 10: 139 ms per loop > > In [162]: a = numpy.asfortranarray(a) > > In [163]: %timeit -r 10 a + 1 > 10 loops, best of 10: 465 ms per loop > > ### fft ### > In [146]: %timeit -r 10 numpy.fft.fft(vol, axis=0) > 10 loops, best of 10: 1.16 s per loop > > In [148]: %timeit -r 10 numpy.fft.fft(vol0, axis=2) > 10 loops, best of 10: 1.16 s per loop > > In [149]: vol.flags > Out[149]: > C_CONTIGUOUS : True > F_CONTIGUOUS : False > OWNDATA : True > WRITEABLE : True > ALIGNED : True > UPDATEIFCOPY : False > > In [150]: vol0.flags > Out[150]: > C_CONTIGUOUS : False > F_CONTIGUOUS : True > OWNDATA : False > WRITEABLE : True > ALIGNED : True > UPDATEIFCOPY : False > > In [9]: %timeit -r 10 numpy.fft.fft(vol0, axis=0) > 10 loops, best of 10: 939 ms per loop > > ### mean ### > In [173]: %timeit -r 10 vol.mean() > 10 loops, best of 10: 272 ms per loop > > In [174]: %timeit -r 10 vol0.mean() > 10 loops, best of 10: 683 ms per loop > > ### max ### > In [175]: %timeit -r 10 vol.max() > 10 loops, best of 10: 63.8 ms per loop > > In [176]: %timeit -r 10 vol0.max() > 10 loops, best of 10: 475 ms per loop > > ### min ### > In [177]: %timeit -r 10 vol.min() > 10 loops, best of 10: 63.8 ms per loop > > In [178]: %timeit -r 10 vol0.min() > 10 loops, best of 10: 476 ms per loop > > ### rot90 ### > In [10]: %timeit -r 10 numpy.rot90(vol) > 100000 loops, best of 10: 6.97 s per loop > > In [12]: %timeit -r 10 numpy.rot90(vol0) > 100000 loops, best of 10: 6.92 s per loop > > _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion