"Marco Bodrato" <bodr...@mail.dm.unipi.it> writes:

  This means we are currently work on the _1o1o variants.

Yep.  But the other entry points will be one or to extra insns.

  May I propose a small latency-micro-optimisation for two x86_64 just
  proposed variants? The idea is not to use the register %r10 at all, and
  directly keep the value of v0 in %rax, so that it is already in place when
  the function returns.

Your bd2 seems to cause no slowdowns and is shorter, so feel free to
commit.

Your core2 code is considerably faster for nhm and wsm, somewhat slower
for hwl, bwl, sky, and makes to difference for other CPUs which use this
code.

I tried another variant of the code, with 2x unfolding in order to
alternate the use of rax and v0; this removes a mov insn from one code
path:

        FUNC_ENTRY(2)
        jmp     L(e)            C

        ALIGN(16)               C              K10 BD1 CNR NHM SBR
L(top): cmovc   %rax, u0        C u = |v - u|  0,3 0,3 0,6 0,5 0,5
        cmovc   %r9, v0         C v = min(u,v) 0,3 0,3 2,8 1,7 1,7
        bsf     %rax, %rcx      C              3   3   6   5   5
        shr     R8(%rcx), u0    C              1,7 1,6 2,8 2,8 2,8
L(e):   mov     v0, %rax        C              1   1   4   3   3
        sub     u0, v0          C v - u        2   2   5   4   4
        mov     u0, %r9         C              2   2   3   3   4
        sub     %rax, u0        C u - v        2   2   4   3   4
        jz      L(end)          C

        cmovc   v0, u0          C u = |v - u|  0,3 0,3 0,6 0,5 0,5
        cmovc   %r9, %rax       C v = min(u,v) 0,3 0,3 2,8 1,7 1,7
        bsf     v0, %rcx        C              3   3   6   5   5
        shr     R8(%rcx), u0    C              1,7 1,6 2,8 2,8 2,8
        mov     %rax, v0        C              1   1   4   3   3
        sub     u0, %rax        C v - u        2   2   5   4   4
        mov     u0, %r9         C              2   2   3   3   4
        sub     v0, u0          C u - v        2   2   4   3   4
        jnz     L(top)          C

L(e2):  mov     v0, %rax
L(end): FUNC_EXIT()
        ret


Unfortunately, this code is not always an improvement either.  It is
faster for cnr, pnr, bwl and sky.  It is slower than your code for nhm
and wsm.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to