Torbjörn Granlund <t...@gmplib.org> writes: > I think you should delay writing through QP to avoid adc to a memory > place, and have just one plain write through QP per iteration. > > The dec UN and the branch might run faster if put adjacent to each > other, as many CPUs fuse these into a single instruction.
Same as in the current (from 2013) version. Delaying the write is a bit tricky, since we already use all registers. But it would be better to update the quotient limbs in memory only in the unlikely carry-propagation case. I figure adc to memory is no worse than explicit load, adc, store (or adc from memory, store)? I could try moving the dec. I often try to insert independent instructions between depending ones, but perhaps that's bad in this case (and generally not very helpful on processors with powerful out-of-order capabilities). > Your cycle numbers should proably be multiplied by a factor > > ("turbo" frequency) / (nominal frequency) > > as 7.x c/l seems faster than we ever measured. lscpu says Model name: AMD Ryzen 5 PRO 4650U with Radeon Graphics Stepping: 1 Frequency boost: enabled CPU MHz: 1397.125 CPU max MHz: 2100.0000 CPU min MHz: 1400.0000 Does that mean that 1.5 (2100 / 1400) is the right factor? Then it's more like 11.0 c/l vs 11.5. At least benchmark numbers are a lot more consistent between runs on this machine, than they were on my previous laptop. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel