[maemo-developers] Optimized memory copying functions for Nokia 770 (final part)
Hello All, Here is an old link with some benchmarks and initial information: http://maemo.org/pipermail/maemo-developers/2006-March/003269.html Now for more completeness, memcpy equivalent is also available and the functions exist in two flavours (either gcc inline macros, or just assembly code), all the sources are here: https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/fastmem-arm9/?root=mplayer The easiest way to try this code is just linking 'fastmem-arm9.S' with your code, it will override glibc 'memcpy' and 'memset' functions with this optimized implementation. But it will probably not affect code that is contained in other shared libararies, for example SDL will still most likely use functions from glibc. If you decide to try using gcc inline macros, it may be not safe, beware of compiler bugs, more details and testcases are here: https://maemo.org/bugzilla/show_bug.cgi?id=733 Anyway, this code may be useful for various games, emulators or any software that may need to clear/initialize or copy large memory blocks fast. So those who are interested, may scavenge something useful there :) At least adding a variation of this this code to allegro game programming library for bitmaps blitting/clearing functions allowed to improve framerate in ufo2000 quite a lot. Sure, that's because of nonoptimal full screen update method which is not very fast and battery friendly anyway and should be changed to screen updates only for the parts of screen that were changed. But sometimes you may have to update full screen anyway, for example when you have it filled with fire and smoke animation. So having fast bitmaps blitting code and being able to just update full screen and have no problems with performance may be a good thing. Technical explanation (at least my understanding of it) is the following. Nokia 770 cpu has some small amount of write back cache, but it is not write allocate. That means if some memory block is already cached, write operation is fast and data is stored immediately to cache. But if some memory block is not cached, it can get to cpu data cache only after read operation, but not write (read allocate cache behaviour). If destination buffer in not in cache, write to it will be performed directly to memory using write buffer. Transfers to memory are performed using blocks of 4, 16, or 32 bytes and these blocks should be aligned. See '5.7 TCM write buffer' and '6.2.2 Transfer size' from http://www.arm.com/pdfs/DDI0198D_926_TRM.pdf So if you write to memory one byte at once, memory bandwidth is wasted (you get only one byte written per memory bus transfer operation, while you could easily get 4 bytes written instead). Here is the worst possible memcpy implementation for example, if you benchmark it, you will get some interesting numbers: void memcpy_trivial(uint8_t *dst, uint8_t *src, int count) { while (--count = 0) *dst++ = *src++; } But the best performance is achieved when using 16 bytes transfers (aligned at 16 bytes boundary, otherwise it will be just split into some 4 byte transfers). This can't be coded in C, and the use of assembly STM instruction with 4 registers as operands is needed (or any number of registers that is multiple of 4). ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Optimized memory copying functions for Nokia 770
Siarhei Siamashka wrote: ... It is strange that such 16-byte alignment trick was neither used in uclibc nor in glibc until now. One more option is that this improvement is only Nokia 770 specific and nobody else ever encountered it or had to use. Well, do we really care anyway? ;) Now I just really badly want to see the benchmark results from some other cpu, preferably intel xscale :) Just got report from running my test on Sharp Zaurus SL-C760: --- running correctness tests --- all the correctness tests passed --- running performance tests (memory bandwidth benchmark) ---: memset() memory bandwidth: 80.35MB/s memset8() memory bandwidth: 83.55MB/s memcpy() memory bandwidth (perfectly aligned): 45.29MB/s memcpy16() memory bandwidth (perfectly aligned): 45.20MB/s memcpy() memory bandwidth (16-bit aligned): 43.15MB/s memcpy16() memory bandwidth (16-bit aligned): 38.27MB/s --- testing performance for random blocks (size 0-15 bytes) --- memset time: 0.960 memset8 time: 0.880 --- testing performance for random blocks (size 0-511 bytes) --- memset time: 3.840 memset8 time: 3.670 So memory copying functions on Zaurus are already optimal for this Zaurus and my implementation only causes performance degradation :) There are two possibilities now: 1. This particular Zaurus has a much better memcpy implementation worth looking at 2. Optimizations for memcpy are very cpu dependant and good code for Nokia does not necessery work good for Zaurus and vice versa. PS. Nokia seems to have a much faster memory than Zaurus by the way :) ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Optimized memory copying functions for Nokia 770
There seems to be no source for the functions in the tarball. Tomas Siarhei Siamashka wrote: Hello All, Here are the optimized memory copying functions for Nokia 770 (memset is more than twice faster, memcpy improves about 10-40% depending on relative data blocks alignment). http://ufo2000.sourceforge.net/files/fastmem-arm-20060312.tar.gz These functions were created as an attempt to experiment with getting maximum memory bandwith on Nokia 770 (powered by TI OMAP1710) and also learning ARM assembler in process. Getting maximum memory bandwidth utilization is needed for 2D games and probably other applications which need to process a lot of multimedia data. I'm particularly interested in getting the best performance for Allegro game programming library (http://alleg.sourceforge.net) on Nokia 770 and that was the motivation for writing this code. After a few experiments with reading/writing memory using different data size for each memory access operation, appears that writing in a bigger chunks is much more important for reading, that means writing 16-bits per memory access is usually twice faster than writing using 8-bit, 32-bit memory access is also twice faster than 16-bit access. There is no such significant performance degradation for reading with smaller chunks, so optimizing reading seems to be less important. After trying some orher half empirical experiments with writing to memory even more seems like the most efficient memory bandwidth is achieved by using 16-byte burst writes aligned on 16-byte boundary using STM instruction. And this seems to provide at least twice better memory bandwidth utilization than the standard 'memset' function on Nokia 770. Having such fantastic results, I decided to try making some optimized functions that can serve as a replacement for standard memset/memcpy functions. Aligned 16-byte write with STM instruction is a core part of all these functions, all the rest of code deals with leading/trailing unaligned data chunks. It implements the following functions (see more detailed comments in the code): memset8, memset16, memset32 - replacements for memset, optimized for different alignment memcpy16, memset32 - replacements for memcpy, optimized for different alignment Testing framework is included, which allows to ensure that this code provides valid results and is also really fast. In order to run the tests, this file should be compiled as c-source with FASTMEM_ARM_TEST_FRAMEWORK macro defined. Requirements for running this code: little endian ARM v4 compatible cpu Results from my Nokia 770 are the following: --- running correctness tests --- all the correctness tests passed --- running performance tests (memory bandwidth benchmark) ---: memset() memory bandwidth: 121.22MB/s memset8() memory bandwidth: 275.94MB/s memcpy() memory bandwidth (perfectly aligned): 104.86MB/s memcpy16() memory bandwidth (perfectly aligned): 113.98MB/s memcpy() memory bandwidth (16-bit aligned): 70.37MB/s memcpy16() memory bandwidth (16-bit aligned): 101.31MB/s --- testing performance for random blocks (size 0-15 bytes) --- memset time: 0.410 memset8 time: 0.260 --- testing performance for random blocks (size 0-511 bytes) --- memset time: 2.360 memset8 time: 1.140 TODO: 1. implement memcpy8 function (direct replacement for memcpy) 2. provide big endian support (currently the code is little endian) 3. investigate possibilities for getting the best performance on short buffer sizes 4. better testing in real world and on different ARM based devices I'm especially interested in getting feedback from running this code on different devices. It is quite possible that these functions are only optimal for OMAP1710, but do not provide any benefit on other devices. Currently this code improves Allegro game programming library performance quite a lot (in my not yet finished patch), but it might be also used for SDL. It is interesting if using these functions can improve GTK performance as well. In that case we could have a nice user interface responsivety improvement. As soon as a complete replacement for memcpy (memcpy8) is done, it can be probably also used as a patch for glibc to improve performance of all the programs automagically. Waiting for feedback, suggestions and test results on other ARM devices (not only Nokia 770). ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Optimized memory copying functions for Nokia 770
Tomas Frydrych wrote: There seems to be no source for the functions in the tarball. Siarhei Siamashka wrote: Hello All, Here are the optimized memory copying functions for Nokia 770 (memset is more than twice faster, memcpy improves about 10-40% depending on relative data blocks alignment). http://ufo2000.sourceforge.net/files/fastmem-arm-20060312.tar.gz ... Like Dirk already replied, the implementation is in macros in the .h file. I'm sorry for not providing detailed instructions about using this tarball. Here they are: # wget http://ufo2000.sourceforge.net/files/fastmem-arm-20060312.tar.gz # tar -xzf fastmem-arm-20060312.tar.gz # cd fastmem-arm Now compile and run the test(in scratchbox using sbrsh cpu transparency method): # gcc -O2 -o fastmem-arm-test fastmem-arm-test.c # ./fastmem-arm-test If you want to use this optimized code in your programs, just add fastmem-arm.h file to your project and the following line into your source files: #include fastmem-arm.h And now you can use these functions (which are provided as a set of macros using inline assembler, so they are all contained within fastmem-arm.h file which is their source), the most simple to use is 'memset8', it is a direct replacement for 'memset' and can be used instead of it to provide a huge performance boost. The functions are optimized for different alignments, for example: uint16_t *memcpy16(uint16_t *dst, uint16_t *src, int count) It copies only 16-bit buffers, but it still can be used for a fast copy of 16-bit pixel data (as Nokia 770 uses 16-bit display). I can make 'memcpy8' function later, but expect its sources to grow about twice and become much more complicated (because of more complicated handling of leading/trailing bytes and 2 more relative alignment combinations). It will take some time. Hope this information helps. Still waiting for feedback :) ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Optimized memory copying functions for Nokia 770
Like Dirk already replied, the implementation is in macros in the .h file. I see. That makes the comparison with memcpy somewhat unfair, since you are not actually providing replacement functions, so this would only make difference for -O3 type optimatisation (where you trade speed for size); it would be interesting to see what the performance difference is if you add the C prologue and epilogue.# BTW, you can instruct gcc to use inlined assembler version of its memcpy and friends as well, I think -O3 includes this, but if I read bits/string.h correctly in my sbox, there are no such inlined functions on the arm though, so there is certainly value in doing this. Tomas ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Optimized memory copying functions for Nokia 770
Hi, That makes the comparison with memcpy somewhat unfair, since you are not actually providing replacement functions, so this would only make difference for -O3 type optimatisation (where you trade speed for size); it would be interesting to see what the performance difference is if you add the C prologue and epilogue.# One should also remember that inlining functions increases the code size. On trivial sized test programs this is not an issue, but in real programs it is, especially with the RAM and cache sizes that ARM has. BTW, you can instruct gcc to use inlined assembler version of its memcpy and friends as well, I think -O3 includes this, but if I read bits/string.h correctly in my sbox, there are no such inlined functions on the arm though, so there is certainly value in doing this. AFAIK gcc will use it's own inline functions if the size is constant (it doesn't come from the C-library then). - Eero ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Optimized memory copying functions for Nokia 770
Tomas Frydrych wrote: Like Dirk already replied, the implementation is in macros in the .h file. I see. That makes the comparison with memcpy somewhat unfair, since you are not actually providing replacement functions, so this would only make difference for -O3 type optimatisation (where you trade speed for size); it would be interesting to see what the performance difference is if you add the C prologue and epilogue.# Memory bandwidth benchmarking is done on 2MB memory block, so prologue and epilogue code does not introduce any noticeable difference. I did not pay much attention on optimizing prologue/epilogue code yet, it should make difference on smaller buffer sizes, but it is in a TODO list. BTW, you can instruct gcc to use inlined assembler version of its memcpy and friends as well, I think -O3 includes this, but if I read bits/string.h correctly in my sbox, there are no such inlined functions on the arm though, so there is certainly value in doing this. Well you got the source, so you can do your own benchmarks either with -O2 or -O3 or even -O9 and post them here :) ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers
Re: [maemo-developers] Optimized memory copying functions for Nokia 770
Eero Tamminen wrote: That makes the comparison with memcpy somewhat unfair, since you are not actually providing replacement functions, so this would only make difference for -O3 type optimatisation (where you trade speed for size); it would be interesting to see what the performance difference is if you add the C prologue and epilogue.# One should also remember that inlining functions increases the code size. On trivial sized test programs this is not an issue, but in real programs it is, especially with the RAM and cache sizes that ARM has. Sometimes inlining makes sense, sometimes it does not. In my case (blitting code for allegro game programming library) it does, just quoting myself: Also just improving glibc might not give the best results. Imagine a code for 16bpp bitmaps blitting. It contains a tight loop of copying pixels one line at a time. If we need to get the best performance possible, especially for small bitmaps with only a few horizontal pixels, extra overhead caused by a memcpy function call and also extra check for alignment (which is known to be 16-bit in this case) might make a noticeable difference. So directly inlining code from that 'memcpy16' macro will be better in this case. By the way, I tried to search for asm optimized versions of memcpy for ARM platforms. Did not do that before as my mistake was that I assumed glibc memcpy/memset implementations to be already optimized as much as posible. Appears that there is fast memcpy implementation in uclibc and there are also much more other implementations around. Seems like I tried to reinvent the wheel. Too bad if it appears that spending the whole 2 days on weekend was a useless waste of time :( Well, at least I did not try to steal someone's else code and 'copyright' it. As I told before, my observations show that it is better to align writes on 16-byte boundaries at least on Nokia 770. The code I have posted is a proof of concept code and it shows that it is faster than default memset/memcpy on the device. I'm going to compare my code with uclibc implementation, if uclibc is in fact faster or has the same performance, I'll have to apologize for causing this mess and go away ashamed. In any case, performance of memcpy/memset on default Nokia 770 image is far from optimal. And considering that the device is certainly not overpowered, improvements in this area might probably help. Just checked GTK sources, memcpy is used in a lot of places, don't know whether it affects performance much though. Is it something worth investigating by Nokia developers? ___ maemo-developers mailing list maemo-developers@maemo.org https://maemo.org/mailman/listinfo/maemo-developers