ni...@lysator.liu.se (Niels Möller) writes: Same as in the current (from 2013) version. Delaying the write is a bit tricky, since we already use all registers. But it would be better to update the quotient limbs in memory only in the unlikely carry-propagation case. I figure adc to memory is no worse than explicit load, adc, store (or adc from memory, store)?
Which is worse depends on CPU and magic. I did not realise that the register pressure was so bad. Perhaps trying to decrease that would be most helpful. Sometimes, when values tend to naturally migrate, some unrolling can reduce register pressure. When I struggle with register pressure, I usually annotate the code with what regs are live and where each dies. That then can steer pressure reducing transformations. If requiring mulx helps, I would for now forget about mul. All relevant CPUs have mulx. I could try moving the dec. I often try to insert independent instructions between depending ones, but perhaps that's bad in this case (and generally not very helpful on processors with powerful out-of-order capabilities). Insn fusion happens only with branches (and perhaps cmov) and iirc only if the insn are adjacent. Does that mean that 1.5 (2100 / 1400) is the right factor? Then it's more like 11.0 c/l vs 11.5. You could explore this by testing some plain loop, e.g. .text .globl main main: mov $ASSUMED_FREQUENCY, %rax 1: dec %rdx dec %rdx dec %rdx dec %rdx dec %rdx dec %rdx dec %rdx dec %rdx dec %rdx dec %rdx dec %rax jnz 1b ret If ASSUMED_FREQUENCY is right, it should take 10 seconds. Else, I leave it as an exercise to compute the actual frequency. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel