Re: GDC review process.

Don Clugston Wed, 20 Jun 2012 07:18:00 -0700

On 20/06/12 13:22, Manu wrote:

On 20 June 2012 13:59, Don Clugston <[email protected]
<mailto:[email protected]>> wrote:


    You and I seem to be from different planets. I have almost never
    written as asm function which was suitable for inlining.

    Take a look at std.internal.math.biguintX86.d

    I do not know how to write that code without inline asm.


Interesting.
I wish I could paste some counter-examples, but they're all proprietary >_<

I think they key detail here is where you stated, they _always_ include
a loop. Is this because it's hard to manipulate the compiler into the
correct interaction with the flags register?

No. It's just because speed doesn't matter outside loops. A consequenceof having the loop be inside the asm code, is that the parameter passingis much less significant for speed, and calling convention is the big

I'd be interested to compare the compiled D code, and your hand written
asm code, to see where exactly the optimiser goes wrong. It doesn't look
like you're exploiting too many tricks (at a brief glance), it's just
nice tight hand written code, which the optimiser should theoretically
be able to get right...

Theoretically, yes. In practice, DMD doesn't get anywhere near, and gccisn't much better. I don't think there's any reason why they couldn't,but I don't have much hope that they will.

As you say, the code looks fairly straightforward, but actually thereare very many similar ways of writing the code, most of which are muchslower. There are many bottlenecks you need to avoid. I was only able toget it to that speed by using the processor profiling registers.


So, my original two uses for asm are actually:

(1) when the language prevents you from accessing low-levelfunctionality; and

(2) when the optimizer isn't good enough.

I find optimisers are very good at code simplification, assuming that
you massage the code/expressions to neatly match any architectural quirks.
I also appreciate that good x86 code is possibly the hardest
architecture for an optimiser to get right...

Optimizers improved enormously during the 80's and 90's, but the rate ofimprovement seems to have slowed.

With x86, out-of-order execution has made it very easy to get reasonablygood code, and much harder to achieve perfection. Still, Core i7 is mucheasier than Core2, since Intel removed one of the most complicatedbottlenecks (on core2 and earlier there is a max 3 reads per cycle, ofregisters you haven't written to in the previous 3 cycles).

Re: GDC review process.

Reply via email to