"Marco Bodrato" <bodr...@mail.dm.unipi.it> writes: This means we are currently work on the _1o1o variants.
Yep. But the other entry points will be one or to extra insns. May I propose a small latency-micro-optimisation for two x86_64 just proposed variants? The idea is not to use the register %r10 at all, and directly keep the value of v0 in %rax, so that it is already in place when the function returns. Your bd2 seems to cause no slowdowns and is shorter, so feel free to commit. Your core2 code is considerably faster for nhm and wsm, somewhat slower for hwl, bwl, sky, and makes to difference for other CPUs which use this code. I tried another variant of the code, with 2x unfolding in order to alternate the use of rax and v0; this removes a mov insn from one code path: FUNC_ENTRY(2) jmp L(e) C ALIGN(16) C K10 BD1 CNR NHM SBR L(top): cmovc %rax, u0 C u = |v - u| 0,3 0,3 0,6 0,5 0,5 cmovc %r9, v0 C v = min(u,v) 0,3 0,3 2,8 1,7 1,7 bsf %rax, %rcx C 3 3 6 5 5 shr R8(%rcx), u0 C 1,7 1,6 2,8 2,8 2,8 L(e): mov v0, %rax C 1 1 4 3 3 sub u0, v0 C v - u 2 2 5 4 4 mov u0, %r9 C 2 2 3 3 4 sub %rax, u0 C u - v 2 2 4 3 4 jz L(end) C cmovc v0, u0 C u = |v - u| 0,3 0,3 0,6 0,5 0,5 cmovc %r9, %rax C v = min(u,v) 0,3 0,3 2,8 1,7 1,7 bsf v0, %rcx C 3 3 6 5 5 shr R8(%rcx), u0 C 1,7 1,6 2,8 2,8 2,8 mov %rax, v0 C 1 1 4 3 3 sub u0, %rax C v - u 2 2 5 4 4 mov u0, %r9 C 2 2 3 3 4 sub v0, u0 C u - v 2 2 4 3 4 jnz L(top) C L(e2): mov v0, %rax L(end): FUNC_EXIT() ret Unfortunately, this code is not always an improvement either. It is faster for cnr, pnr, bwl and sky. It is slower than your code for nhm and wsm. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel