ARM Neon popcount

2013-02-27 Thread Torbjorn Granlund
I decided to play a bit with Neon, but instead of doing something hard
like addmul_k, I wrote an mpn_popcount.  :-)

The code runs well for A15 at about 0.56 c/l, but much worse on A9 at
about 2.8 c/l.  (The inner-loops hard whacking on q8 is a problem on A9;
using a8 and a9 alternatingly shaves off about 0.4 c/l.  Still
unimpressive.)

I am a novice at Neon hacking, so I am sure this can be improved in
various ways.

Specific questions:
* I completely ignore alignment.  Is that bad?
* Can 32 bits be read to a dN register with zeroing of the other 32
  bits?  (See comment surely we can read)
* Could one shave of an instruction in the final accumulation?  We don't
  really need 64-bit accumulators.
* Can one read four 128-bit values using just one insn (for inner loop)?



arm-popcount.asm
Description: Binary data

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: ARM Neon popcount

2013-02-27 Thread Richard Henderson

On 2013-02-27 13:27, Torbjorn Granlund wrote:

Specific questions:
* I completely ignore alignment.  Is that bad?


I'm not sure about that.  It's something that perhaps we should 
experiment with.  As written, the code will work, as the chip will 
handle totally unaligned data.  What I don't know is whether 
*specifying* increased alignment in the insn helps.  E.g.


vld1.32 { q1, q2 }, [r0@128]!

As specified in section A.3.2.1, if you specify the alignment it will 
also be checked, so you'll get SIGBUS if its not right.



* Can 32 bits be read to a dN register with zeroing of the other 32
   bits?  (See comment surely we can read)


No.  But you don't have to go through a core register as you did,
you can read directly into a single lane:

vmov.i64d0, #0
vld1.i32{d0[0]}, [up]!


* Could one shave of an instruction in the final accumulation?  We don't
   really need 64-bit accumulators.


How about:
C we have 8 16-bit counts
L(e0):  vpaddl.u16  q8, q8  C we have 4 32-bit counts
vmovr0, r1, d16
vmovr2, r3, d17
add r0, r0, r1
add r2, r2, r3
add r0, r0, r2

It trades 1 vpaddl for two add insns, but the total latency is probably 
a cycle or two better since we're now operating in core.



* Can one read four 128-bit values using just one insn (for inner loop)?


No.  We can only read 4 64-bit values.  I didn't actually realize the 
assembler would accept Q registers in the list grammar non-terminal.



r~
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: ARM Neon popcount

2013-02-27 Thread Richard Henderson

On 2013-02-27 14:33, Torbjorn Granlund wrote:

vld1.32 { q1, q2 }, [r0@128]!

   As specified in section A.3.2.1, if you specify the alignment it will
   also be checked, so you'll get SIGBUS if its not right.

I wanted to experiment, but I cannot find any syntax which is accepted
by gas.  @128 does not work (in gas 2.22).


I had to use the disassembler to figure it out.  Gas uses a colon.

vld1.64 {d0-d3}, [r0:128]

Which while not obvious, I should have figured it had be something else since 
@ begins a comment in ARM assembly.


And, I lied about not being able to read 4 128-bit registers in one insn.
You can't do it with VLD[1-4], but you can with VLDM.

Something else to look at is whether VLDR and VLDM perform better on A9.

The one thing that you do have to worry about there is that VLD[1-4] load 
consecutive elements as defined by the data type, whereas VLD[RM] load full 
64-bit registers.  This distinction matters in big-endian mode.


Of course, the big-endian caveat doesn't apply to popcount.


   It trades 1 vpaddl for two add insns, but the total latency is
   probably a cycle or two better since we're now operating in core.

Need to test that, I think.  I fear the corereg-vreg bandwidth might
be poor.


If it's awful, one could perform the final fold with vadd.i64 d16, d17 and 
perform only one move to r0, swallowing that latency in the function return.



r~
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


GMP symbol naming (and the history thereof)?

2013-02-27 Thread Richard Henderson
Several times over the past week as I debug my neon routines, it has 
become painfully apparent (as I accidentally single-step into the 
dynamic linker) that the shared libgmp could use some help in 
modernizing its internal linkage.


Most important is arranging for calls within GMP to go through private 
names, so that e.g. mpn_mul doesn't have to go through the PLT in order 
to reach mpn_addmul_1.


There are several ways to approach this fix, and I can talk about those 
in followups, but to begin we need to have a discussion as to exactly 
what the public interface is.  This dove tails nicely with the work that 
Neils has done recently with tidying up mpn.


What I'm curious about is the __g prefix associated with 95% of the 
symbols.  Why is gmp defining symbols in the namespace reserved to the 
compiler and the implementation aka the standard c library?  I mean, I 
guess we're not actually getting into trouble for it, since it's 
actually working, but I don't think it's very clean.


If at some point we make some change that requires the bumping of the 
shared library version number, I think it would be a good opportunity to 
drop the prefix, bringing the API and ABI names back in sync.  But until 
we do make some other incompatible ABI change it's probably not worth it.


But the first thing to do is to confirm exactly what API symbols should 
be exported.  I'm going to begin with the assumption that if it isn't 
declared in gmp-h.in, then it's not public.



r~
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: ARM Neon popcount

2013-02-27 Thread Niels Möller
Richard Henderson r...@twiddle.net writes:

 On 2013-02-27 13:27, Torbjorn Granlund wrote:

 * Can one read four 128-bit values using just one insn (for inner loop)?

 No.  We can only read 4 64-bit values.  I didn't actually realize the
 assembler would accept Q registers in the list grammar non-terminal.

What about vldm? Like

vldmup!, {q0,q1,q2,q3}

As far as I understand the manual, it supports a larger number of
registers. The registers must be consecutive, but that's no problem
here.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel