On 4/30/07, Daniel Stone <[EMAIL PROTECTED]> wrote:

> There are two important optimizations in this code:
> 1. Cache prefetch with PLD instruction (added in '_armv5' version) which
> boosts performance to 70 megapixels per second. Inner loop is unrolled
> to process 32 pixels per iteration (cache line size is 32 bytes on ARM, so
> such unrolling is convenient). This is the most important improvement.
> You can try using __builtin_prefetch() from C code to do the same
> optimization.

Ah, sounds useful.  From what Dan Amelang's been saying on xorg@, gcc
should coalesce four 32-bit reads into one 128-bit read, but this sounds
promising as well.

To expand on this: I was referring to fact that gcc is pretty smart
about using ldmia/stdmia instructions to cluster sequential
reads/writes. I see that Siarhei is already using this technique in
his assembler code, so nothing new here.

Dan
_______________________________________________
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers

Reply via email to