On Mon, 2005-06-06 at 13:26 -0400, Michael Jennings wrote:

> > #define BINSWAP(a, b) \
> >    (((long) (a)) ^= ((long) (b)) ^= ((long) (a)) ^= ((long) (b)))
> > 
> > int main( void )
> > {
> >   long a = 3;
> >   long b = 8;
> > 
> >   asm( "noop;noop;noop" );
> >   BINSWAP(a,b);
> >   asm( "noop;noop;noop" );
> > 
> > }

I've been using the "gcc -S foo.c" trick to examine what gcc really does
since I made such a large and incorrect assumption about the way that
gcc handles function parameters in my attempt to port Eterm's MMX stuff
to SSE2.  I was going too fast here as no-operation is actually spelled
"nop" but the resulting code is the same, this just won't assemble or
run.  The "nop" is just an easy way to place markers in the code so you
can find the interesting parts quickly.  In reality it costs one clock
cycle for each "nop".
 
> > yields:
> > 
> >         noop;noop;noop
> >         movq    -16(%rbp), %rdx
> >         leaq    -8(%rbp), %rax 
> >         xorq    %rdx, (%rax)   
> >         movq    -8(%rbp), %rdx 
> >         leaq    -16(%rbp), %rax
> >         xorq    %rdx, (%rax)   
> >         movq    -16(%rbp), %rdx
> >         leaq    -8(%rbp), %rax 
> >         xorq    %rdx, (%rax)   
> >         noop;noop;noop
> > 
> > If you enable -O[123] then you will need to use the values a & b before
> > and after the BINSWAP call or they will be optimized away.  And simply
> > using immediate values like I did will cause the compiler to simply set
> > the different registers that are used to access them in reverse order.
> > In other words the swap gets optimized out.  The above code is without
> > -O and is clearly more complicated (by more than double) than it needs
> > to be.
> 
> Interesting.  You think I should just get rid of it then?
--->  snip  <---
> It's actually slower normally but faster optimized:
> 
> -O0
> Profiling SWAP() macro...3000000 iterations in 0.052468 seconds, 1.7489e-08 
> seconds per iteration
> Profiling BINSWAP() macro...3000000 iterations in 0.067905 seconds, 
> 2.2635e-08 seconds per iteration
> 
> -O2
> Profiling SWAP() macro...3000000 iterations in 0.014328 seconds, 4.776e-09 
> seconds per iteration
> Profiling BINSWAP() macro...3000000 iterations in 0.014183 seconds, 
> 4.7277e-09 seconds per iteration
> 
> (Done with libast's "make perf")

The performance difference is 1.0116% according to your profiling so the
question is:  Do you, the author and maintainer, think that a 1%
performance difference is worth the maintenance problems that it might
cause?  Another thought is that you are running Linux and there is a
possibility of the process getting preempted and messing with the
timings.  Caching issues might have arisen and if the 3 million values
aren't actually used they might not really be calculated; I still can't
predict, with any certainty, how gcc's optimizer works.  That's why I
look at the output and why my test program didn't actually perform a
swap (I didn't use the values) with -O[123].  In practice how many times
is it called?  To really test it you might want to consider loading a
large pixmap and creating a black pixmap of the same size and swapping
the pixels between the two.  At least then you can be sure that the
values are used.  Maybe even average 100 runs to minimize L1/L2/L3 cache
hits/misses.

Did you try any of the asm stuff?  That is probably more of a
maintenance problem than the other stuff though.  If you are interested
I did find a way to reduce the instructions by another 25%.
Unfortunately the "xchg" op can't take two memory locations.

#define BINSWAP(a, b) \
  asm( "movq (%%rsi), %%rax  \n\t" \
       "xchg  %%rax, (%%rdi) \n\t" \
       "movq  %%rax, (%%rsi) \n\t" \
       :: "S" (&a), "D" (&b)  : "%rax" \        
     );

I don't want to make any presumptions about your code but If you are
asking me I would just let gcc handle the optimizations.  Take what I
say with a grain of salt as I have been away from programming for years
and have only recently gotten back into it as the errors in my patches
to Eterm will attest.  By the way there is still an outstanding patch of
mine that fixes one of my synapse misfires.  pixmap.c, line 1588 should
use 0x7c00 instead of 0xfc00.  My apologies for a cut + paste screw-up.
If you look at the patch (in an earlier email) you can see I added
things like (a>>0) for readability but am counting on gcc to optimize it
away for performance.  Raster often does this too (I've been reading his
code recently and working with kwo on e-16.8).

Best Regards,
-- 
Tres



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Reply via email to