On Fri, 2007-02-16 at 18:43 +0200, Peter wrote:
> On Fri, 16 Feb 2007, Gilboa Davara wrote:
>
> > On Thu, 2007-02-15 at 19:23 +0200, Peter wrote:
> >> On Thu, 15 Feb 2007, Gilboa Davara wrote:
> >>
> >>> Small example.
> >>> About two years ago I go bored, and decided to implement binary trees in
> >>> (x86) Assembly.
> >>> The end result was between 2-10 times faster then GCC (-O2/-O3)
> >>> generated code. (Depending the size of the tree)
> >>> The main reason being the lack of a 3 way comparison in C.
> >>> (above/below/equal)
> >>
> >> And assembly lacks it too.
> >
> > ????????!!!?
> >
> > cmp $eax,$ebx
> > jb label_below
> > ja label_above
> > <equal code>
>
> Each jump is equivalent with a cache line flush.
(Before I begin, my code targets x86_64 [AMD Opteron, Xeon 5xxx] and
i386 [P4 Xeon] - nothing else.)
- I'm talking about a short (+127/-128) jumps.
- As far as I remember:
AMD Opteron's L1I cache line size is 64b.
P4/Xeon is 128b.
Core2 is 64b.
- Now, both the AMD and Core2 use aggressive pre-fetching that will
usually result in multiple adjacent instruction cache lines.
In short, It is very likely that as long as you keep your in-line
assembly code -small-, you will most fit your code nicely inside the L1
I cache.
>
> >> But in C you can get creative with compound
> >> statements:
> >> int x,y;
> >> register int t;
> >>
> >> (t = x - y) && (((t < 0) && below()) || above()) || equal();
> >
> > .. Which will only work if the below/above/equal are made of short
> > statements which is a very problematic pre-requisite.
>
> inline int below(your,optional,arguments);
>
> will work fine. So will:
>
> #define below(a,b,c) (z=a+b+c)
Been there, done that.
As I said, under both Windows and Linux the asm code yielded (much)
better performance.
>
> > In my case I needed to store some additional information in each leaf -
> > making each step a compound statement by itself. (which in-turn,
> > rendered your compound less effective)
>
> Don't be so sure about that. A compound statement can be optimized very
> well.
.. Which will make is as readable as the asm code - or far worse...
>
> >> which wastes 1 register variable. Still, there is no guarantee that this
> >> generates faster code than an optimizing compiler (and gcc is not known
> >> among the best optimizing compilers). Rewriting above using binary
> >> operators and masks may be even faster.
> >
> > The same code was also tested under Visual Studio 2K3 and showed the
> > same results. The assembly code was considerably faster then the VS
> > generate binary.
>
> Assembly is not portable and it is a *** to debug.
No argument there.
(Though if you make your compound code complex enough, it'll make the
asm code far more debug-able)
> Yes, you can make it run faster. It's fun for the 1st few days, after that
> you need to change
> something or port it to a NSLU2 and things stop being nice very fast.
> Especially if someone else needs to compile your code.
As I said above, I usually use -small- (<20 lines) blocks of in-line
assembly code.
Other then that, I'm fanatical about documentation. (Mostly because I
have a very small brain and it takes me 5 minutes to forget why I
trashed rax)
>
> >> Atomic code execution should not require assembly because segment
> >> locking can be done using C (even if that C is inline assembly for
> >> some applications).
> >
> > A. I -was- talking about in-line assembly.
> > B. How can I implement "lock btX/inc/dec/sub/add" in pure C?
> > (Let alone using the resulting flags. [setXX])
> >
> > BTW, another valid excuse to using assembly (at least in
> > register-barren-world-known-as-i386) is the ability to trash the base
> > pointer. (every register count.)
>
> Again, why are you assuming x86 assembly is the target ? It could be ARM
> or MIPS or PPC.
If I'm writing multi-platform code, I'll keep in-line assembly to the
minimum. (or none)
Contrary to popular belief, I'm not that mad ;)
> Optimizing x86 makes sense for extreme driver writing,
> kernel code and such.
But it pays my mortgage ;)
> Otherwise it makes little sense on a platform that
> doubles its MIPS speed every 2 years. lock exists only on x86 and it
> exists because x86 is a brainf***d architecture that allows 'long
> instructions' (once upon a time known as microcode) to be interrupted in
> the middle. I assure you that this is a very unique feature among CPUs.
True.
But my target is i386/x86_64. (With the rare SPARC/POWER from time to
time)
> Think about it, it's the only popular CPU that can be proud of being
> theoretically able to throw an EINTR *inside* a machine code
> instruction. Modifying BP + small mistake = crash. Oops.
Naaah.
Stack frame? who needs it ;)
Seriously, the IA32 is brain-dead - no arguments there.
But this brain-dead architecture managed to capture most of the computer
market - and unlike Windows, it does have technical merits. (E.g. IA64
vs. IA32).
My favorite architecture was the Digital Alpha, but it's water under the
bridge now...
>
> Peter
- Gilboa
=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]