On 2013-02-27 13:27, Torbjorn Granlund wrote:
Specific questions:
* I completely ignore alignment.  Is that bad?

I'm not sure about that. It's something that perhaps we should experiment with. As written, the code will work, as the chip will handle totally unaligned data. What I don't know is whether *specifying* increased alignment in the insn helps. E.g.

        vld1.32         { q1, q2 }, [r0@128]!

As specified in section A.3.2.1, if you specify the alignment it will also be checked, so you'll get SIGBUS if its not right.

* Can 32 bits be read to a dN register with zeroing of the other 32
   bits?  (See comment "surely we can read...".)

No.  But you don't have to go through a core register as you did,
you can read directly into a single lane:

        vmov.i64        d0, #0
        vld1.i32        {d0[0]}, [up]!

* Could one shave of an instruction in the final accumulation?  We don't
   really need 64-bit accumulators.

How about:
                                        C we have 8 16-bit counts
L(e0):  vpaddl.u16      q8, q8          C we have 4 32-bit counts
        vmov            r0, r1, d16
        vmov            r2, r3, d17
        add             r0, r0, r1
        add             r2, r2, r3
        add             r0, r0, r2

It trades 1 vpaddl for two add insns, but the total latency is probably a cycle or two better since we're now operating in core.

* Can one read four 128-bit values using just one insn (for inner loop)?

No. We can only read 4 64-bit values. I didn't actually realize the assembler would accept Q registers in the <list> grammar non-terminal.


r~
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to