[maemo-developers] Optimized memory copying functions for Nokia 770 (final part)

2006-12-04 Thread Siarhei Siamashka
Hello All,

Here is an old link with some benchmarks and initial information:
http://maemo.org/pipermail/maemo-developers/2006-March/003269.html

Now for more completeness, memcpy equivalent is also available and 
the functions exist in two flavours (either gcc inline macros, or just
assembly code), all the sources are here:
https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/fastmem-arm9/?root=mplayer

The easiest way to try this code is just linking 'fastmem-arm9.S' with your
code, it will override glibc 'memcpy' and 'memset' functions with this
optimized implementation. But it will probably not affect code that is
contained in other shared libararies, for example SDL will still most likely
use functions from glibc. If you decide to try using gcc inline macros, it may 
be not safe, beware of compiler bugs, more details and testcases are here:
https://maemo.org/bugzilla/show_bug.cgi?id=733

Anyway, this code may be useful for various games, emulators or any 
software that may need to clear/initialize or copy large memory blocks 
fast. So those who are interested, may scavenge something useful there :)

At least adding a variation of this this code to allegro game programming
library for bitmaps blitting/clearing functions allowed to improve framerate
in ufo2000 quite a lot. Sure, that's because of nonoptimal full screen update
method which is not very fast and battery friendly anyway and should be
changed to screen updates only for the parts of screen that were changed. 
But sometimes you may have to update full screen anyway, for example 
when you have it filled with fire and smoke animation. So having fast
bitmaps blitting code and being able to just update full screen and have no 
problems with performance may be a good thing.

Technical explanation (at least my understanding of it) is the following.
Nokia 770 cpu has some small amount of write back cache, but it is not 
write allocate. That means if some memory block is already cached, write
operation is fast and data is stored immediately to cache. But if some 
memory block is not cached, it can get to cpu data cache only after read
operation, but not write (read allocate cache behaviour). If destination
buffer in not in cache, write to it will be performed directly to memory using 
write buffer. Transfers to memory are performed using blocks of 4, 16, or 32 
bytes and these blocks should be aligned. See '5.7 TCM write buffer'
and '6.2.2 Transfer size' from http://www.arm.com/pdfs/DDI0198D_926_TRM.pdf
So if you write to memory one byte at once, memory bandwidth is wasted (you 
get only one byte written per memory bus transfer operation, while you could 
easily get 4 bytes written instead). Here is the worst possible memcpy
implementation for example, if you benchmark it, you will get some
interesting numbers:

void memcpy_trivial(uint8_t *dst, uint8_t *src, int count)
{
while (--count = 0) *dst++ = *src++;
}

But the best performance is achieved when using 16 bytes transfers (aligned 
at 16 bytes boundary, otherwise it will be just split into some 4 byte
transfers). This can't be coded in C, and the use of assembly STM instruction
with 4 registers as operands is needed (or any number of registers that is
multiple of 4).
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Optimized memory copying functions for Nokia 770

2006-03-17 Thread Siarhei Siamashka

Siarhei Siamashka wrote:

 ...

It is strange that such 16-byte alignment trick was neither used in
uclibc nor in glibc until now. One more option is that this improvement
is only Nokia 770 specific and nobody else ever encountered it or had to
use. Well, do we really care anyway? ;)

Now I just really badly want to see the benchmark results from some
other cpu, preferably intel xscale :)


Just got report from running my test on Sharp Zaurus SL-C760:

--- running correctness tests ---
all the correctness tests passed
--- running performance tests (memory bandwidth benchmark) ---:
memset() memory bandwidth: 80.35MB/s
memset8() memory bandwidth: 83.55MB/s
memcpy() memory bandwidth (perfectly aligned): 45.29MB/s
memcpy16() memory bandwidth (perfectly aligned): 45.20MB/s
memcpy() memory bandwidth (16-bit aligned): 43.15MB/s
memcpy16() memory bandwidth (16-bit aligned): 38.27MB/s
--- testing performance for random blocks (size 0-15 bytes) ---
memset time: 0.960
memset8 time: 0.880
--- testing performance for random blocks (size 0-511 bytes) ---
memset time: 3.840
memset8 time: 3.670

So memory copying functions on Zaurus are already optimal for this
Zaurus and my implementation only causes performance degradation :)

There are two possibilities now:
1. This particular Zaurus has a much better memcpy implementation worth
   looking at
2. Optimizations for memcpy are very cpu dependant and good code for
   Nokia does not necessery work good for Zaurus and vice versa.

PS. Nokia seems to have a much faster memory than Zaurus by the way :)


___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Optimized memory copying functions for Nokia 770

2006-03-14 Thread Tomas Frydrych
There seems to be no source for the functions in the tarball.

Tomas

Siarhei Siamashka wrote:
 Hello All,
 
 Here are the optimized memory copying functions for Nokia 770 (memset is
 more than twice faster, memcpy improves about 10-40% depending on
 relative data blocks alignment).
 
 http://ufo2000.sourceforge.net/files/fastmem-arm-20060312.tar.gz
 
 These functions were created as an attempt to experiment with getting
 maximum memory bandwith on Nokia 770 (powered by TI OMAP1710) and also
 learning ARM assembler in process. Getting maximum memory bandwidth
 utilization is needed for 2D games and probably other applications which
 need to process a lot of multimedia data. I'm particularly interested in
 getting the best performance for Allegro game programming library
 (http://alleg.sourceforge.net) on Nokia 770 and that was the motivation
 for writing this code.
 
 After a few experiments with reading/writing memory using different data
 size for each memory access operation, appears that writing in a bigger
 chunks is much more important for reading, that means writing 16-bits
 per memory access is usually twice faster than writing using 8-bit,
 32-bit memory access is also twice faster than 16-bit access. There is
 no such significant performance degradation for reading with smaller
 chunks, so optimizing reading seems to be less important. After trying
 some orher half empirical experiments with writing to memory even more
 seems like the most efficient memory bandwidth is achieved by using
 16-byte burst writes aligned on 16-byte boundary using STM instruction.
 And this seems to provide at least twice better memory bandwidth
 utilization than the standard 'memset' function on Nokia 770. Having
 such fantastic results, I decided to try making some optimized functions
 that can serve as a replacement for standard memset/memcpy functions.
 Aligned 16-byte write with STM instruction is a core part of all these
 functions, all the rest of code deals with leading/trailing unaligned
 data chunks.
 
 It implements the following functions (see more detailed comments in the
 code):
 memset8, memset16, memset32 - replacements for memset, optimized
   for different alignment
 memcpy16, memset32  - replacements for memcpy, optimized
   for different alignment
 
 Testing framework is included, which allows to ensure that this code
 provides valid results and is also really fast. In order to run the
 tests, this file should be compiled as c-source with
 FASTMEM_ARM_TEST_FRAMEWORK macro defined.
 
 Requirements for running this code: little endian ARM v4 compatible cpu
 
 Results from my Nokia 770 are the following:
 
--- running correctness tests ---
all the correctness tests passed
--- running performance tests (memory bandwidth benchmark) ---:
memset() memory bandwidth: 121.22MB/s
memset8() memory bandwidth: 275.94MB/s
memcpy() memory bandwidth (perfectly aligned): 104.86MB/s
memcpy16() memory bandwidth (perfectly aligned): 113.98MB/s
memcpy() memory bandwidth (16-bit aligned): 70.37MB/s
memcpy16() memory bandwidth (16-bit aligned): 101.31MB/s
--- testing performance for random blocks (size 0-15 bytes) ---
memset time: 0.410
memset8 time: 0.260
--- testing performance for random blocks (size 0-511 bytes) ---
memset time: 2.360
memset8 time: 1.140
 
 TODO:
1. implement memcpy8 function (direct replacement for memcpy)
2. provide big endian support (currently the code is little endian)
3. investigate possibilities for getting the best performance
   on short buffer sizes
4. better testing in real world and on different ARM based devices
 
 I'm especially interested in getting feedback from running this code on
 different devices. It is quite possible that these functions are only
 optimal for OMAP1710, but do not provide any benefit on other devices.
 
 Currently this code improves Allegro game programming library
 performance quite a lot (in my not yet finished patch), but it might be
 also used for SDL. It is interesting if using these functions can
 improve GTK performance as well. In that case we could have a nice user
 interface responsivety improvement.
 
 As soon as a complete replacement for memcpy (memcpy8) is done, it can
 be probably also used as a patch for glibc to improve performance of all
 the programs automagically.
 
 Waiting for feedback, suggestions and test results on other ARM devices
 (not only Nokia 770).
 
 
 ___
 maemo-developers mailing list
 maemo-developers@maemo.org
 https://maemo.org/mailman/listinfo/maemo-developers

___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Optimized memory copying functions for Nokia 770

2006-03-14 Thread Siarhei Siamashka

Tomas Frydrych wrote:


There seems to be no source for the functions in the tarball.



Siarhei Siamashka wrote:

Hello All,

Here are the optimized memory copying functions for Nokia 770 
(memset is more than twice faster, memcpy improves about 10-40% 
depending on relative data blocks alignment).


http://ufo2000.sourceforge.net/files/fastmem-arm-20060312.tar.gz

...


Like Dirk already replied, the implementation is in macros in the .h
file. I'm sorry for not providing detailed instructions about using this
tarball. Here they are:

# wget http://ufo2000.sourceforge.net/files/fastmem-arm-20060312.tar.gz
# tar -xzf fastmem-arm-20060312.tar.gz
# cd fastmem-arm

Now compile and run the test(in scratchbox using sbrsh cpu
transparency method):

# gcc -O2 -o fastmem-arm-test fastmem-arm-test.c
# ./fastmem-arm-test

If you want to use this optimized code in your programs, just add
fastmem-arm.h file to your project and the following line into your
source files:

#include fastmem-arm.h

And now you can use these functions (which are provided as a set of
macros using inline assembler, so they are all contained within
fastmem-arm.h file which is their source), the most simple to use is
'memset8', it is a direct replacement for 'memset' and can be used
instead of it to provide a huge performance boost.

The functions are optimized for different alignments, for example:
uint16_t *memcpy16(uint16_t *dst, uint16_t *src, int count)

It copies only 16-bit buffers, but it still can be used for a fast copy
of 16-bit pixel data (as Nokia 770 uses 16-bit display). I can make
'memcpy8' function later, but expect its sources to grow about twice
and become much more complicated (because of more complicated handling
of leading/trailing bytes and 2 more relative alignment combinations).
It will take some time.

Hope this information helps. Still waiting for feedback :)


___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Optimized memory copying functions for Nokia 770

2006-03-14 Thread Tomas Frydrych

 Like Dirk already replied, the implementation is in macros in the .h
 file.

I see. That makes the comparison with memcpy somewhat unfair, since you
are not actually providing replacement functions, so this would only
make difference for -O3 type optimatisation (where you trade speed for
size); it would be interesting to see what the performance difference is
if you add the C prologue and epilogue.#

BTW, you can instruct gcc to use inlined assembler version of its memcpy
and friends as well, I think -O3 includes this, but if I read
bits/string.h correctly in my sbox, there are no such inlined functions
on the arm though, so there is certainly value in doing this.

Tomas
___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Optimized memory copying functions for Nokia 770

2006-03-14 Thread Eero Tamminen
Hi,

 That makes the comparison with memcpy somewhat unfair, since you
 are not actually providing replacement functions, so this would only
 make difference for -O3 type optimatisation (where you trade speed for
 size); it would be interesting to see what the performance difference is
 if you add the C prologue and epilogue.#

One should also remember that inlining functions increases the code
size.  On trivial sized test programs this is not an issue, but
in real programs it is, especially with the RAM and cache sizes
that ARM has.


 BTW, you can instruct gcc to use inlined assembler version of its memcpy
 and friends as well, I think -O3 includes this, but if I read
 bits/string.h correctly in my sbox, there are no such inlined functions
 on the arm though, so there is certainly value in doing this.

AFAIK gcc will use it's own inline functions if the size is constant
(it doesn't come from the C-library then).


- Eero

___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Optimized memory copying functions for Nokia 770

2006-03-14 Thread Siarhei Siamashka

Tomas Frydrych wrote:


Like Dirk already replied, the implementation is in macros in the .h
file.


I see. That makes the comparison with memcpy somewhat unfair, since you
are not actually providing replacement functions, so this would only
make difference for -O3 type optimatisation (where you trade speed for
size); it would be interesting to see what the performance difference is
if you add the C prologue and epilogue.#


Memory bandwidth benchmarking is done on 2MB memory block, so prologue
and epilogue code does not introduce any noticeable difference.

I did not pay much attention on optimizing prologue/epilogue code yet,
it should make difference on smaller buffer sizes, but it is in a TODO
list.


BTW, you can instruct gcc to use inlined assembler version of its memcpy
and friends as well, I think -O3 includes this, but if I read
bits/string.h correctly in my sbox, there are no such inlined functions
on the arm though, so there is certainly value in doing this.


Well you got the source, so you can do your own benchmarks either with
-O2 or -O3 or even -O9 and post them here :)


___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers


Re: [maemo-developers] Optimized memory copying functions for Nokia 770

2006-03-14 Thread Siarhei Siamashka

Eero Tamminen wrote:

That makes the comparison with memcpy somewhat unfair, since you 
are not actually providing replacement functions, so this would 
only make difference for -O3 type optimatisation (where you trade 
speed for size); it would be interesting to see what the 
performance difference is if you add the C prologue and epilogue.#


One should also remember that inlining functions increases the code 
size.  On trivial sized test programs this is not an issue, but in 
real programs it is, especially with the RAM and cache sizes that ARM

 has.


Sometimes inlining makes sense, sometimes it does not. In my case
(blitting code for allegro game programming library) it does, just
quoting myself:

Also just improving glibc might not give the best results. Imagine a
 code for 16bpp bitmaps blitting. It contains a tight loop of copying
 pixels one line at a time. If we need to get the best performance 
possible, especially for small bitmaps with only a few horizontal 
pixels, extra overhead caused by a memcpy function call and also 
extra check for alignment (which is known to be 16-bit in this case) 
might make a noticeable difference. So directly inlining code from 
that 'memcpy16' macro will be better in this case.



By the way, I tried to search for asm optimized versions of memcpy for
ARM platforms. Did not do that before as my mistake was that I assumed
glibc memcpy/memset implementations to be already optimized as much as
posible.

Appears that there is fast memcpy implementation in uclibc and there are
also much more other implementations around. Seems like I tried to
reinvent the wheel. Too bad if it appears that spending the whole 2 days
on weekend was a useless waste of time :( Well, at least I did not try
to steal someone's else code and 'copyright' it.

As I told before, my observations show that it is better to align
writes on 16-byte boundaries at least on Nokia 770. The code I have
posted is a proof of concept code and it shows that it is faster than
default memset/memcpy on the device. I'm going to compare my code with
uclibc implementation, if uclibc is in fact faster or has the same
performance, I'll have to apologize for causing this mess and go away
ashamed.

In any case, performance of memcpy/memset on default Nokia 770 image is
far from optimal. And considering that the device is certainly not
overpowered, improvements in this area might probably help. Just checked
GTK sources, memcpy is used in a lot of places, don't know whether it
affects performance much though. Is it something worth investigating by
Nokia developers?


___
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers