ARM Neon popcount
I decided to play a bit with Neon, but instead of doing something hard like addmul_k, I wrote an mpn_popcount. :-) The code runs well for A15 at about 0.56 c/l, but much worse on A9 at about 2.8 c/l. (The inner-loops hard whacking on q8 is a problem on A9; using a8 and a9 alternatingly shaves off about 0.4 c/l. Still unimpressive.) I am a novice at Neon hacking, so I am sure this can be improved in various ways. Specific questions: * I completely ignore alignment. Is that bad? * Can 32 bits be read to a dN register with zeroing of the other 32 bits? (See comment surely we can read) * Could one shave of an instruction in the final accumulation? We don't really need 64-bit accumulators. * Can one read four 128-bit values using just one insn (for inner loop)? arm-popcount.asm Description: Binary data -- Torbjörn ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM Neon popcount
On 2013-02-27 13:27, Torbjorn Granlund wrote: Specific questions: * I completely ignore alignment. Is that bad? I'm not sure about that. It's something that perhaps we should experiment with. As written, the code will work, as the chip will handle totally unaligned data. What I don't know is whether *specifying* increased alignment in the insn helps. E.g. vld1.32 { q1, q2 }, [r0@128]! As specified in section A.3.2.1, if you specify the alignment it will also be checked, so you'll get SIGBUS if its not right. * Can 32 bits be read to a dN register with zeroing of the other 32 bits? (See comment surely we can read) No. But you don't have to go through a core register as you did, you can read directly into a single lane: vmov.i64d0, #0 vld1.i32{d0[0]}, [up]! * Could one shave of an instruction in the final accumulation? We don't really need 64-bit accumulators. How about: C we have 8 16-bit counts L(e0): vpaddl.u16 q8, q8 C we have 4 32-bit counts vmovr0, r1, d16 vmovr2, r3, d17 add r0, r0, r1 add r2, r2, r3 add r0, r0, r2 It trades 1 vpaddl for two add insns, but the total latency is probably a cycle or two better since we're now operating in core. * Can one read four 128-bit values using just one insn (for inner loop)? No. We can only read 4 64-bit values. I didn't actually realize the assembler would accept Q registers in the list grammar non-terminal. r~ ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM Neon popcount
On 2013-02-27 14:33, Torbjorn Granlund wrote: vld1.32 { q1, q2 }, [r0@128]! As specified in section A.3.2.1, if you specify the alignment it will also be checked, so you'll get SIGBUS if its not right. I wanted to experiment, but I cannot find any syntax which is accepted by gas. @128 does not work (in gas 2.22). I had to use the disassembler to figure it out. Gas uses a colon. vld1.64 {d0-d3}, [r0:128] Which while not obvious, I should have figured it had be something else since @ begins a comment in ARM assembly. And, I lied about not being able to read 4 128-bit registers in one insn. You can't do it with VLD[1-4], but you can with VLDM. Something else to look at is whether VLDR and VLDM perform better on A9. The one thing that you do have to worry about there is that VLD[1-4] load consecutive elements as defined by the data type, whereas VLD[RM] load full 64-bit registers. This distinction matters in big-endian mode. Of course, the big-endian caveat doesn't apply to popcount. It trades 1 vpaddl for two add insns, but the total latency is probably a cycle or two better since we're now operating in core. Need to test that, I think. I fear the corereg-vreg bandwidth might be poor. If it's awful, one could perform the final fold with vadd.i64 d16, d17 and perform only one move to r0, swallowing that latency in the function return. r~ ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
GMP symbol naming (and the history thereof)?
Several times over the past week as I debug my neon routines, it has become painfully apparent (as I accidentally single-step into the dynamic linker) that the shared libgmp could use some help in modernizing its internal linkage. Most important is arranging for calls within GMP to go through private names, so that e.g. mpn_mul doesn't have to go through the PLT in order to reach mpn_addmul_1. There are several ways to approach this fix, and I can talk about those in followups, but to begin we need to have a discussion as to exactly what the public interface is. This dove tails nicely with the work that Neils has done recently with tidying up mpn. What I'm curious about is the __g prefix associated with 95% of the symbols. Why is gmp defining symbols in the namespace reserved to the compiler and the implementation aka the standard c library? I mean, I guess we're not actually getting into trouble for it, since it's actually working, but I don't think it's very clean. If at some point we make some change that requires the bumping of the shared library version number, I think it would be a good opportunity to drop the prefix, bringing the API and ABI names back in sync. But until we do make some other incompatible ABI change it's probably not worth it. But the first thing to do is to confirm exactly what API symbols should be exported. I'm going to begin with the assumption that if it isn't declared in gmp-h.in, then it's not public. r~ ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel
Re: ARM Neon popcount
Richard Henderson r...@twiddle.net writes: On 2013-02-27 13:27, Torbjorn Granlund wrote: * Can one read four 128-bit values using just one insn (for inner loop)? No. We can only read 4 64-bit values. I didn't actually realize the assembler would accept Q registers in the list grammar non-terminal. What about vldm? Like vldmup!, {q0,q1,q2,q3} As far as I understand the manual, it supports a larger number of registers. The registers must be consecutive, but that's no problem here. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel