Eero Tamminen wrote:

That makes the comparison with memcpy somewhat unfair, since you are not actually providing replacement functions, so this would only make difference for -O3 type optimatisation (where you trade speed for size); it would be interesting to see what the performance difference is if you add the C prologue and epilogue.#

One should also remember that inlining functions increases the code size. On trivial sized test programs this is not an issue, but in real programs it is, especially with the RAM and cache sizes that ARM
 has.

Sometimes inlining makes sense, sometimes it does not. In my case
(blitting code for allegro game programming library) it does, just
quoting myself:
Also just improving glibc might not give the best results. Imagine a
 code for 16bpp bitmaps blitting. It contains a tight loop of copying
pixels one line at a time. If we need to get the best performance possible, especially for small bitmaps with only a few horizontal pixels, extra overhead caused by a memcpy function call and also extra check for alignment (which is known to be 16-bit in this case) might make a noticeable difference. So directly inlining code from that 'memcpy16' macro will be better in this case.


By the way, I tried to search for asm optimized versions of memcpy for
ARM platforms. Did not do that before as my mistake was that I assumed
glibc memcpy/memset implementations to be already optimized as much as
posible.

Appears that there is fast memcpy implementation in uclibc and there are
also much more other implementations around. Seems like I tried to
reinvent the wheel. Too bad if it appears that spending the whole 2 days
on weekend was a useless waste of time :( Well, at least I did not try
to steal someone's else code and 'copyright' it.

As I told before, my observations show that it is better to align
writes on 16-byte boundaries at least on Nokia 770. The code I have
posted is a proof of concept code and it shows that it is faster than
default memset/memcpy on the device. I'm going to compare my code with
uclibc implementation, if uclibc is in fact faster or has the same
performance, I'll have to apologize for causing this mess and go away
ashamed.

In any case, performance of memcpy/memset on default Nokia 770 image is
far from optimal. And considering that the device is certainly not
overpowered, improvements in this area might probably help. Just checked
GTK sources, memcpy is used in a lot of places, don't know whether it
affects performance much though. Is it something worth investigating by
Nokia developers?


_______________________________________________
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers

Reply via email to