hi,
I posted a pull with a minor change instructing the GCC compiler to
unroll the strided copy loops (gcc will almost never do that on its own,
not even on O3).

https://github.com/numpy/numpy/pull/3429
It improves performance of these copies by 20%-50% depending on the size
of the data copied (if it goes out of all cpu caches you don't gain
anything anymore) on a couple machines (amd phenom x4, intel core2duo,
xeon 7xxx/5xxx)

As overriding the compiler decision is always dodgy, I would like some
numbers on a couple of cpu types to decide if its really a good idea.
So if you have the time please try the pull and the benchmark in the
first comment and report the difference in performance between the pull
and the unchanged numpy git head in the PR.
please include your cpu, gcc version and architecture (32 bit or 64 bit).
The benchmark can be run with ipython:
irunner --ipython bench.py


Cheers,
Julian
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to