Re: [directfb-dev] ARM asm memcpy

Deng XueFeng Tue, 24 Mar 2009 19:58:55 -0700

vince,
    here is the test on davinci 6446.

r...@davinci:/opt/dev/src/slib/memtest/src# uname -a
Linux davinci 2.6.10_mvl401 #194 Mon Jan 12 14:01:52 CST 2009 armv5tejl 
GNU/Linux



libc() memcpy bandwidth (align=0, size=100):                     88.24MB/s
armasm() memcpy bandwidth (align=0, size=100):                   90.91MB/s
armasm2() memcpy bandwidth (align=0, size=100):                  96.00MB/s

libc() memcpy bandwidth (align=1, size=100):                     69.77MB/s
armasm() memcpy bandwidth (align=1, size=100):                   34.38MB/s
armasm2() memcpy bandwidth (align=1, size=100):                  73.17MB/s

libc() memcpy bandwidth (align=2, size=100):                     69.77MB/s
armasm() memcpy bandwidth (align=2, size=100):                   34.38MB/s
armasm2() memcpy bandwidth (align=2, size=100):                  71.86MB/s

libc() memcpy bandwidth (align=3, size=100):                     70.18MB/s
armasm() memcpy bandwidth (align=3, size=100):                   34.38MB/s
armasm2() memcpy bandwidth (align=3, size=100):                  72.29MB/s


> John,
> 
> Ive change my benchmark to invalidate the cache before every test. My
> result are the same. Attached is my test program.
> 
> 
> # ./memtest 4096
> libc() memcpy bandwidth (align=0, size=4096):
> 16.99MB/s
> armasm() memcpy bandwidth (align=0, size=4096):
> 40.96MB/s
> armasm2() memcpy bandwidth (align=0, size=4096):
> 40.96MB/s
> 
> libc() memcpy bandwidth (align=1, size=4096):
> 16.99MB/s
> armasm() memcpy bandwidth (align=1, size=4096):
> 20.29MB/s
> armasm2() memcpy bandwidth (align=1, size=4096):
> 37.49MB/s
> 
> libc() memcpy bandwidth (align=2, size=4096):
> 16.99MB/s
> armasm() memcpy bandwidth (align=2, size=4096):
> 20.29MB/s
> armasm2() memcpy bandwidth (align=2, size=4096):
> 37.62MB/s
> 
> libc() memcpy bandwidth (align=3, size=4096):
> 16.99MB/s
> armasm() memcpy bandwidth (align=3, size=4096):
> 20.29MB/s
> armasm2() memcpy bandwidth (align=3, size=4096):
> 37.49MB/s
> 
> 
> Regards,
> 
> Vince
> 
> 
> On Tue, 2009-03-24 at 23:21 +1000, John Williams wrote:
> > Hi,
> > 
> > I've been watching with half-interest on this thread, and just thought
> > I'd throw in a thought I've had.
> > 
> > Nowhere in this thread does the word "cache" appear - in your
> > benchmarks are you invalidating the cache between benchmark runs?  If
> > the cache is cold on the first run (which is always the "slower" glibc
> > version), and hot on subsequent runs, it will be distorting your
> > results.
> > 
> > Maybe you've factored for this, but I don't think it's been explicitly
> > mentioned so far.  It could explain why others are not seeing the same
> > dramatic speedups that you are reporting.
> > 
> > Cheers,
> > 
> > John
> > 
> > 2009/3/24 vince <[email protected]>:
> > > Niels,
> > >
> > > After a closer review of the code, I found that unaligned copy were a
> > > lot slower them aligned 1s. Ive created an other version of the routine
> > > that will take take of that. Attached to this email, you will find a
> > > simple program that I used to test this code. This program will test
> > > both aligned and unaligned (src & dst) of the 3 diff implementation
> > > (libc memcpy, rev1 armasm memcpy, and rev2 armasm memcpy).
> > >
> > > Here is the output of the program running on an arm9 AT91RM9200 using
> > > uClibc-0.9.30 and gcc-4.2.4:
> > > armasm is rev1, and armasm2 is rev2
> > >
> > > # ./memtest 500000
> > > 32bit src/dst Aligned test:
> > > Testing libc (0x4005a008 <==> 0x40243008 : 500000):
> > > 2.996949 sec
> > > Testing armasm (0x4005a008 <==> 0x40243008 : 500000):
> > > 1.331787 sec
> > > Testing armasm2 (0x4005a008 <==> 0x40243008 : 500000):
> > > 1.358246 sec
> > > The faster routine is armasm
> > >
> > > 16bit src/dst Aligned test:
> > > Testing libc (0x4005a00a <==> 0x4024300a : 500000):
> > > 2.983215 sec
> > > Testing armasm (0x4005a00a <==> 0x4024300a : 500000):
> > > 1.332214 sec
> > > Testing armasm2 (0x4005a00a <==> 0x4024300a : 500000):
> > > 1.358978 sec
> > > The faster routine is armasm
> > >
> > > 8bit src/dst Aligned test:
> > > Testing libc (0x4005a009 <==> 0x40243009 : 500000):
> > > 2.982209 sec
> > > Testing armasm (0x4005a009 <==> 0x40243009 : 500000):
> > > 1.331054 sec
> > > Testing armasm2 (0x4005a009 <==> 0x40243009 : 500000):
> > > 1.359162 sec
> > > The faster routine is armasm
> > >
> > > 16bit src Aligned test:
> > > Testing libc (0x4005a00a <==> 0x40243008 : 500000):
> > > 2.983734 sec
> > > Testing armasm (0x4005a00a <==> 0x40243008 : 500000):
> > > 2.571228 sec
> > > Testing armasm2 (0x4005a00a <==> 0x40243008 : 500000):
> > > 1.419556 sec
> > > The faster routine is armasm2
> > >
> > > 8bit src Aligned test:
> > > Testing libc (0x4005a009 <==> 0x40243008 : 500000):
> > > 2.984101 sec
> > > Testing armasm (0x4005a009 <==> 0x40243008 : 500000):
> > > 2.570343 sec
> > > Testing armasm2 (0x4005a009 <==> 0x40243008 : 500000):
> > > 1.419525 sec
> > > The faster routine is armasm2
> > >
> > > 16bit dst Aligned test:
> > > Testing libc (0x4005a008 <==> 0x4024300a : 500000):
> > > 2.983948 sec
> > > Testing armasm (0x4005a008 <==> 0x4024300a : 500000):
> > > 2.571563 sec
> > > Testing armasm2 (0x4005a008 <==> 0x4024300a : 500000):
> > > 1.418671 sec
> > > The faster routine is armasm2
> > >
> > > 8bit dst Aligned test:
> > > Testing libc (0x4005a008 <==> 0x40243009 : 500000):
> > > 2.983521 sec
> > > Testing armasm (0x4005a008 <==> 0x40243009 : 500000):
> > > 2.571258 sec
> > > Testing armasm2 (0x4005a008 <==> 0x40243009 : 500000):
> > > 1.418762 sec
> > > The faster routine is armasm2
> > >
> > >
> > > As you can see, rev2 works a lot better with unaligned buffers. I will
> > > update the patch to DirectFB to include this new version of the routine.
> > >
> > >
> > > As for the big-endian, this version will ONLY work with little-endian,
> > > so a config directive will need to be set for the build to work on those
> > > targets. I will include that in the patch.
> > >
> > > For now, it would be great if I could get some metrics from people to
> > > double check my result.
> > >
> > > Regards,
> > >
> > > Vince
> > >
> > >
> > >
> > >
> > > On Mon, 2009-03-23 at 16:36 +0100, Niels Roest wrote:
> > >> Hi Vince,
> > >> I'm happy to include the patch,
> > >> I just have a few unclarities, hope somebody can clear them..
> > >>
> > >> (1) memcpy is speed tested with (I think) aligned accesses (based on
> > >> D_MALLOC adresses) but I think we'll see a lot of unaligned memcpy's
> > >> too, but that side of the implementation looks kinda weak.. Anyone care
> > >> to give some figures for unaligned copy? Have a look at
> > >> direct_find_best_memcpy() in lib/direct/memcpy.c, and fidget a bit with
> > >> buf1 and buf2.
> > >> (2) what happens on a big-endian ARM if I just include the patch? Having
> > >> trouble finding this dependancy in the patch.. Will need to fix this, or
> > >> put a show stopper somewhere for big-endian, so the patch doesn't break
> > >> something.
> > >>
> > >> Greets
> > >> Niels
> > >>
> > >> vince wrote:
> > >> > Hello,
> > >> >
> > >> > Ive been working on trying to improve the performance of directfb 1.3.0
> > >> > on the arm platform. The attached patch will replace the default libc
> > >> > memcpy with a faster implementation. Ive tested this patch using an
> > >> > AT91RM9200, but should work on other ARM targets.
> > >> >
> > >> > Hope this will be useful to others.
> > >> >
> > >> > Regards,
> > >> >
> > >> > Vince
> > >> >
> > >> > ------------------------------------------------------------------------
> > >> >
> > >> > _______________________________________________
> > >> > directfb-dev mailing list
> > >> > [email protected]
> > >> > http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev
> > >>
> > >>
> > >
> > > _______________________________________________
> > > directfb-dev mailing list
> > > [email protected]
> > > http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev
> > >
> > >
> > 
> > 
> > 

-- 
Deng XueFeng <[email protected]>

_______________________________________________
directfb-dev mailing list
[email protected]
http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev

Re: [directfb-dev] ARM asm memcpy

Reply via email to