On 4/30/07, Daniel Stone <[EMAIL PROTECTED]> wrote:
> There are two important optimizations in this code: > 1. Cache prefetch with PLD instruction (added in '_armv5' version) which > boosts performance to 70 megapixels per second. Inner loop is unrolled > to process 32 pixels per iteration (cache line size is 32 bytes on ARM, so > such unrolling is convenient). This is the most important improvement. > You can try using __builtin_prefetch() from C code to do the same > optimization. Ah, sounds useful. From what Dan Amelang's been saying on xorg@, gcc should coalesce four 32-bit reads into one 128-bit read, but this sounds promising as well.
To expand on this: I was referring to fact that gcc is pretty smart about using ldmia/stdmia instructions to cluster sequential reads/writes. I see that Siarhei is already using this technique in his assembler code, so nothing new here. Dan _______________________________________________ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers