hi, I posted a pull with a minor change instructing the GCC compiler to unroll the strided copy loops (gcc will almost never do that on its own, not even on O3).
https://github.com/numpy/numpy/pull/3429 It improves performance of these copies by 20%-50% depending on the size of the data copied (if it goes out of all cpu caches you don't gain anything anymore) on a couple machines (amd phenom x4, intel core2duo, xeon 7xxx/5xxx) As overriding the compiler decision is always dodgy, I would like some numbers on a couple of cpu types to decide if its really a good idea. So if you have the time please try the pull and the benchmark in the first comment and report the difference in performance between the pull and the unchanged numpy git head in the PR. please include your cpu, gcc version and architecture (32 bit or 64 bit). The benchmark can be run with ipython: irunner --ipython bench.py Cheers, Julian _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion