Jaroslav Kysela wrote:
> 
> On Thu, 20 Feb 2003, Abramo Bagnara wrote:
> 
> > Jaroslav Kysela wrote:
> > >
> > > On Wed, 19 Feb 2003, Abramo Bagnara wrote:
> > >
> > > > The results are amazing and I'd say Jaroslav has done some mistakes in
> > > > his handmade asm.
> > >
> > > I don't think so. It seems that my brain still remembers assembler ;-)
> > > You passed wrong values to my code so it did unaligned accesses.
> > >
> > > Fixes to make things same:
> >
> > I've done the needed changes in my version of sum.c to get correct
> > results from asm version, but I'm still unable to get from it good
> > performance numbers.
> >
> > I'm puzzled...
> >
> > $ ./sum 2048 8 32768
> > CPU clock: 1460474444.671998
> > mix_areas0: 90773 0.033459%
> > mix_areas1: 141173 0.052036% (1103)
> > mix_areas2: 870134 0.320731% (0)
> > mix_areas3: 343792 0.126722% (0)
> 
> 1) my asm code used lock prefix so there are huge differences in code for
>    UP and MP on i386

Indeed, this made the difference.

> 2) we need to clear dst and sum buffers to work with same values for all
>    routines

This was present in sum.c

> 3) we need to clear the CPU caches

This has irrelevant impact in sum.c.

> I've commited updated alsa-lib/test/code.c which solves all these troubles
> and I've added next optimizations to my asm routine and results are (not
> impressive, but I'm better than GCC, especially using MMX
> saturation instruction):

Now I'm able to get the same results you see.

However I think that we need to extract some results from this data.

I'll leave alone MMX optimizations because I want to compare apples with
apples.

The distributed saturation (also when it's missing the check/repeat
concurrency correctness part) costs more than 4 times the ticks needed
for a (fully correct wrt concurrency) saturate once approach for the
case 2048 8 32768.

CPU clock: 1460477150.884593
mix_areas0: 86747 0.031975%
mix_areas1: 259424 0.095623% (0)
mix_areas1_mmx: 253894 0.093585% (0)
mix_areas2: 132321 0.048773% (365)
mix_areas3: 332411 0.122526% (0)

The server based approach has an added cost of an extra context switch
every period (about 1500 cycles on my machine i.e.), but this is fully
amortized by such an huge difference.

What's your opinion?

-- 
Abramo Bagnara                       mailto:[EMAIL PROTECTED]

Opera Unica                          Phone: +39.546.656023
Via Emilia Interna, 140
48014 Castel Bolognese (RA) - Italy


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge
_______________________________________________
Alsa-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/alsa-devel

Reply via email to