On 2013-02-27 13:27, Torbjorn Granlund wrote:
Specific questions:
* I completely ignore alignment. Is that bad?
I'm not sure about that. It's something that perhaps we should
experiment with. As written, the code will work, as the chip will
handle totally unaligned data. What I don't know is whether
*specifying* increased alignment in the insn helps. E.g.
vld1.32 { q1, q2 }, [r0@128]!
As specified in section A.3.2.1, if you specify the alignment it will
also be checked, so you'll get SIGBUS if its not right.
* Can 32 bits be read to a dN register with zeroing of the other 32
bits? (See comment "surely we can read...".)
No. But you don't have to go through a core register as you did,
you can read directly into a single lane:
vmov.i64 d0, #0
vld1.i32 {d0[0]}, [up]!
* Could one shave of an instruction in the final accumulation? We don't
really need 64-bit accumulators.
How about:
C we have 8 16-bit counts
L(e0): vpaddl.u16 q8, q8 C we have 4 32-bit counts
vmov r0, r1, d16
vmov r2, r3, d17
add r0, r0, r1
add r2, r2, r3
add r0, r0, r2
It trades 1 vpaddl for two add insns, but the total latency is probably
a cycle or two better since we're now operating in core.
* Can one read four 128-bit values using just one insn (for inner loop)?
No. We can only read 4 64-bit values. I didn't actually realize the
assembler would accept Q registers in the <list> grammar non-terminal.
r~
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel