On 20/06/12 13:22, Manu wrote:
On 20 June 2012 13:59, Don Clugston <[email protected]
<mailto:[email protected]>> wrote:

    You and I seem to be from different planets. I have almost never
    written as asm function which was suitable for inlining.

    Take a look at std.internal.math.biguintX86.d

    I do not know how to write that code without inline asm.


Interesting.
I wish I could paste some counter-examples, but they're all proprietary >_<

I think they key detail here is where you stated, they _always_ include
a loop. Is this because it's hard to manipulate the compiler into the
correct interaction with the flags register?

No. It's just because speed doesn't matter outside loops. A consequence of having the loop be inside the asm code, is that the parameter passing is much less significant for speed, and calling convention is the big

I'd be interested to compare the compiled D code, and your hand written
asm code, to see where exactly the optimiser goes wrong. It doesn't look
like you're exploiting too many tricks (at a brief glance), it's just
nice tight hand written code, which the optimiser should theoretically
be able to get right...

Theoretically, yes. In practice, DMD doesn't get anywhere near, and gcc isn't much better. I don't think there's any reason why they couldn't, but I don't have much hope that they will.

As you say, the code looks fairly straightforward, but actually there are very many similar ways of writing the code, most of which are much slower. There are many bottlenecks you need to avoid. I was only able to get it to that speed by using the processor profiling registers.

So, my original two uses for asm are actually:
(1) when the language prevents you from accessing low-level functionality; and
(2) when the optimizer isn't good enough.

I find optimisers are very good at code simplification, assuming that
you massage the code/expressions to neatly match any architectural quirks.
I also appreciate that good x86 code is possibly the hardest
architecture for an optimiser to get right...

Optimizers improved enormously during the 80's and 90's, but the rate of improvement seems to have slowed.

With x86, out-of-order execution has made it very easy to get reasonably good code, and much harder to achieve perfection. Still, Core i7 is much easier than Core2, since Intel removed one of the most complicated bottlenecks (on core2 and earlier there is a max 3 reads per cycle, of registers you haven't written to in the previous 3 cycles).

Reply via email to