(Niels Möller) writes:

  In Chapter 3, multiplication instructions listed in a table starting on
  page "3-14". But now I see I read the entry for a smaller data size. For
  32-bit inputs, it's apparently 2 cycles, not 1.
It seems to be 2 cycles indeed:

        .globl  main
        .type   main, #function
        mov     r0, #1006632960
1:      subs    r0, r0, #1
        vmull.u32       q2, d0, d0
        vmull.u32       q4, d0, d0
        vmull.u32       q6, d0, d0
        vmull.u32       q8, d0, d0
        bne     1b
        mov     pc, lr

But IIUC, we are thus performing a 32 x 32 -> 64 mul per cycle.
Can one stick addition here without consuming cycles?

gmp-devel mailing list

Reply via email to