vince, here is the test on davinci 6446. r...@davinci:/opt/dev/src/slib/memtest/src# uname -a Linux davinci 2.6.10_mvl401 #194 Mon Jan 12 14:01:52 CST 2009 armv5tejl GNU/Linux
libc() memcpy bandwidth (align=0, size=100): 88.24MB/s armasm() memcpy bandwidth (align=0, size=100): 90.91MB/s armasm2() memcpy bandwidth (align=0, size=100): 96.00MB/s libc() memcpy bandwidth (align=1, size=100): 69.77MB/s armasm() memcpy bandwidth (align=1, size=100): 34.38MB/s armasm2() memcpy bandwidth (align=1, size=100): 73.17MB/s libc() memcpy bandwidth (align=2, size=100): 69.77MB/s armasm() memcpy bandwidth (align=2, size=100): 34.38MB/s armasm2() memcpy bandwidth (align=2, size=100): 71.86MB/s libc() memcpy bandwidth (align=3, size=100): 70.18MB/s armasm() memcpy bandwidth (align=3, size=100): 34.38MB/s armasm2() memcpy bandwidth (align=3, size=100): 72.29MB/s > John, > > Ive change my benchmark to invalidate the cache before every test. My > result are the same. Attached is my test program. > > > # ./memtest 4096 > libc() memcpy bandwidth (align=0, size=4096): > 16.99MB/s > armasm() memcpy bandwidth (align=0, size=4096): > 40.96MB/s > armasm2() memcpy bandwidth (align=0, size=4096): > 40.96MB/s > > libc() memcpy bandwidth (align=1, size=4096): > 16.99MB/s > armasm() memcpy bandwidth (align=1, size=4096): > 20.29MB/s > armasm2() memcpy bandwidth (align=1, size=4096): > 37.49MB/s > > libc() memcpy bandwidth (align=2, size=4096): > 16.99MB/s > armasm() memcpy bandwidth (align=2, size=4096): > 20.29MB/s > armasm2() memcpy bandwidth (align=2, size=4096): > 37.62MB/s > > libc() memcpy bandwidth (align=3, size=4096): > 16.99MB/s > armasm() memcpy bandwidth (align=3, size=4096): > 20.29MB/s > armasm2() memcpy bandwidth (align=3, size=4096): > 37.49MB/s > > > Regards, > > Vince > > > On Tue, 2009-03-24 at 23:21 +1000, John Williams wrote: > > Hi, > > > > I've been watching with half-interest on this thread, and just thought > > I'd throw in a thought I've had. > > > > Nowhere in this thread does the word "cache" appear - in your > > benchmarks are you invalidating the cache between benchmark runs? If > > the cache is cold on the first run (which is always the "slower" glibc > > version), and hot on subsequent runs, it will be distorting your > > results. > > > > Maybe you've factored for this, but I don't think it's been explicitly > > mentioned so far. It could explain why others are not seeing the same > > dramatic speedups that you are reporting. > > > > Cheers, > > > > John > > > > 2009/3/24 vince <vi...@bluush.com>: > > > Niels, > > > > > > After a closer review of the code, I found that unaligned copy were a > > > lot slower them aligned 1s. Ive created an other version of the routine > > > that will take take of that. Attached to this email, you will find a > > > simple program that I used to test this code. This program will test > > > both aligned and unaligned (src & dst) of the 3 diff implementation > > > (libc memcpy, rev1 armasm memcpy, and rev2 armasm memcpy). > > > > > > Here is the output of the program running on an arm9 AT91RM9200 using > > > uClibc-0.9.30 and gcc-4.2.4: > > > armasm is rev1, and armasm2 is rev2 > > > > > > # ./memtest 500000 > > > 32bit src/dst Aligned test: > > > Testing libc (0x4005a008 <==> 0x40243008 : 500000): > > > 2.996949 sec > > > Testing armasm (0x4005a008 <==> 0x40243008 : 500000): > > > 1.331787 sec > > > Testing armasm2 (0x4005a008 <==> 0x40243008 : 500000): > > > 1.358246 sec > > > The faster routine is armasm > > > > > > 16bit src/dst Aligned test: > > > Testing libc (0x4005a00a <==> 0x4024300a : 500000): > > > 2.983215 sec > > > Testing armasm (0x4005a00a <==> 0x4024300a : 500000): > > > 1.332214 sec > > > Testing armasm2 (0x4005a00a <==> 0x4024300a : 500000): > > > 1.358978 sec > > > The faster routine is armasm > > > > > > 8bit src/dst Aligned test: > > > Testing libc (0x4005a009 <==> 0x40243009 : 500000): > > > 2.982209 sec > > > Testing armasm (0x4005a009 <==> 0x40243009 : 500000): > > > 1.331054 sec > > > Testing armasm2 (0x4005a009 <==> 0x40243009 : 500000): > > > 1.359162 sec > > > The faster routine is armasm > > > > > > 16bit src Aligned test: > > > Testing libc (0x4005a00a <==> 0x40243008 : 500000): > > > 2.983734 sec > > > Testing armasm (0x4005a00a <==> 0x40243008 : 500000): > > > 2.571228 sec > > > Testing armasm2 (0x4005a00a <==> 0x40243008 : 500000): > > > 1.419556 sec > > > The faster routine is armasm2 > > > > > > 8bit src Aligned test: > > > Testing libc (0x4005a009 <==> 0x40243008 : 500000): > > > 2.984101 sec > > > Testing armasm (0x4005a009 <==> 0x40243008 : 500000): > > > 2.570343 sec > > > Testing armasm2 (0x4005a009 <==> 0x40243008 : 500000): > > > 1.419525 sec > > > The faster routine is armasm2 > > > > > > 16bit dst Aligned test: > > > Testing libc (0x4005a008 <==> 0x4024300a : 500000): > > > 2.983948 sec > > > Testing armasm (0x4005a008 <==> 0x4024300a : 500000): > > > 2.571563 sec > > > Testing armasm2 (0x4005a008 <==> 0x4024300a : 500000): > > > 1.418671 sec > > > The faster routine is armasm2 > > > > > > 8bit dst Aligned test: > > > Testing libc (0x4005a008 <==> 0x40243009 : 500000): > > > 2.983521 sec > > > Testing armasm (0x4005a008 <==> 0x40243009 : 500000): > > > 2.571258 sec > > > Testing armasm2 (0x4005a008 <==> 0x40243009 : 500000): > > > 1.418762 sec > > > The faster routine is armasm2 > > > > > > > > > As you can see, rev2 works a lot better with unaligned buffers. I will > > > update the patch to DirectFB to include this new version of the routine. > > > > > > > > > As for the big-endian, this version will ONLY work with little-endian, > > > so a config directive will need to be set for the build to work on those > > > targets. I will include that in the patch. > > > > > > For now, it would be great if I could get some metrics from people to > > > double check my result. > > > > > > Regards, > > > > > > Vince > > > > > > > > > > > > > > > On Mon, 2009-03-23 at 16:36 +0100, Niels Roest wrote: > > >> Hi Vince, > > >> I'm happy to include the patch, > > >> I just have a few unclarities, hope somebody can clear them.. > > >> > > >> (1) memcpy is speed tested with (I think) aligned accesses (based on > > >> D_MALLOC adresses) but I think we'll see a lot of unaligned memcpy's > > >> too, but that side of the implementation looks kinda weak.. Anyone care > > >> to give some figures for unaligned copy? Have a look at > > >> direct_find_best_memcpy() in lib/direct/memcpy.c, and fidget a bit with > > >> buf1 and buf2. > > >> (2) what happens on a big-endian ARM if I just include the patch? Having > > >> trouble finding this dependancy in the patch.. Will need to fix this, or > > >> put a show stopper somewhere for big-endian, so the patch doesn't break > > >> something. > > >> > > >> Greets > > >> Niels > > >> > > >> vince wrote: > > >> > Hello, > > >> > > > >> > Ive been working on trying to improve the performance of directfb 1.3.0 > > >> > on the arm platform. The attached patch will replace the default libc > > >> > memcpy with a faster implementation. Ive tested this patch using an > > >> > AT91RM9200, but should work on other ARM targets. > > >> > > > >> > Hope this will be useful to others. > > >> > > > >> > Regards, > > >> > > > >> > Vince > > >> > > > >> > ------------------------------------------------------------------------ > > >> > > > >> > _______________________________________________ > > >> > directfb-dev mailing list > > >> > directfb-dev@directfb.org > > >> > http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev > > >> > > >> > > > > > > _______________________________________________ > > > directfb-dev mailing list > > > directfb-dev@directfb.org > > > http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev > > > > > > > > > > > > -- Deng XueFeng <den...@gmail.com> _______________________________________________ directfb-dev mailing list directfb-dev@directfb.org http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev