Re: [maemo-developers] Optimized memory copying functions for Nokia 770

Tomas Frydrych Tue, 14 Mar 2006 01:23:17 -0800

There seems to be no source for the functions in the tarball.

Tomas


Siarhei Siamashka wrote:
> Hello All,
> 
> Here are the optimized memory copying functions for Nokia 770 (memset is
> more than twice faster, memcpy improves about 10-40% depending on
> relative data blocks alignment).
> 
> http://ufo2000.sourceforge.net/files/fastmem-arm-20060312.tar.gz
> 
> These functions were created as an attempt to experiment with getting
> maximum memory bandwith on Nokia 770 (powered by TI OMAP1710) and also
> learning ARM assembler in process. Getting maximum memory bandwidth
> utilization is needed for 2D games and probably other applications which
> need to process a lot of multimedia data. I'm particularly interested in
> getting the best performance for Allegro game programming library
> (http://alleg.sourceforge.net) on Nokia 770 and that was the motivation
> for writing this code.
> 
> After a few experiments with reading/writing memory using different data
> size for each memory access operation, appears that writing in a bigger
> chunks is much more important for reading, that means writing 16-bits
> per memory access is usually twice faster than writing using 8-bit,
> 32-bit memory access is also twice faster than 16-bit access. There is
> no such significant performance degradation for reading with smaller
> chunks, so optimizing reading seems to be less important. After trying
> some orher half empirical experiments with writing to memory even more
> seems like the most efficient memory bandwidth is achieved by using
> 16-byte burst writes aligned on 16-byte boundary using STM instruction.
> And this seems to provide at least twice better memory bandwidth
> utilization than the standard 'memset' function on Nokia 770. Having
> such fantastic results, I decided to try making some optimized functions
> that can serve as a replacement for standard memset/memcpy functions.
> Aligned 16-byte write with STM instruction is a core part of all these
> functions, all the rest of code deals with leading/trailing unaligned
> data chunks.
> 
> It implements the following functions (see more detailed comments in the
> code):
> memset8, memset16, memset32 - replacements for memset, optimized
>                               for different alignment
> memcpy16, memset32          - replacements for memcpy, optimized
>                               for different alignment
> 
> Testing framework is included, which allows to ensure that this code
> provides valid results and is also really fast. In order to run the
> tests, this file should be compiled as c-source with
> FASTMEM_ARM_TEST_FRAMEWORK macro defined.
> 
> Requirements for running this code: little endian ARM v4 compatible cpu
> 
> Results from my Nokia 770 are the following:
> 
>    --- running correctness tests ---
>    all the correctness tests passed
>    --- running performance tests (memory bandwidth benchmark) ---:
>    memset() memory bandwidth: 121.22MB/s
>    memset8() memory bandwidth: 275.94MB/s
>    memcpy() memory bandwidth (perfectly aligned): 104.86MB/s
>    memcpy16() memory bandwidth (perfectly aligned): 113.98MB/s
>    memcpy() memory bandwidth (16-bit aligned): 70.37MB/s
>    memcpy16() memory bandwidth (16-bit aligned): 101.31MB/s
>    --- testing performance for random blocks (size 0-15 bytes) ---
>    memset time: 0.410
>    memset8 time: 0.260
>    --- testing performance for random blocks (size 0-511 bytes) ---
>    memset time: 2.360
>    memset8 time: 1.140
> 
> TODO:
>    1. implement memcpy8 function (direct replacement for memcpy)
>    2. provide big endian support (currently the code is little endian)
>    3. investigate possibilities for getting the best performance
>       on short buffer sizes
>    4. better testing in real world and on different ARM based devices
> 
> I'm especially interested in getting feedback from running this code on
> different devices. It is quite possible that these functions are only
> optimal for OMAP1710, but do not provide any benefit on other devices.
> 
> Currently this code improves Allegro game programming library
> performance quite a lot (in my not yet finished patch), but it might be
> also used for SDL. It is interesting if using these functions can
> improve GTK performance as well. In that case we could have a nice user
> interface responsivety improvement.
> 
> As soon as a complete replacement for memcpy (memcpy8) is done, it can
> be probably also used as a patch for glibc to improve performance of all
> the programs automagically.
> 
> Waiting for feedback, suggestions and test results on other ARM devices
> (not only Nokia 770).
> 
> 
> _______________________________________________
> maemo-developers mailing list
> maemo-developers@maemo.org
> https://maemo.org/mailman/listinfo/maemo-developers

_______________________________________________
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers

Re: [maemo-developers] Optimized memory copying functions for Nokia 770

Reply via email to