On 29/08/2016 20:46, Richard Henderson wrote: > Changes from v2 to v3: > > * Unit testing. This includes having x86 attempt all versions of > the accelerator that will run on the hardware. Thus an avx2 host > will run the basic test 5 times (1.5sec on my laptop). > > * Drop the ppc and aarch64 specializations. I have improved the > basic integer version to the point that those vectorized versions > are not a win. > > In the case of my aarch64 mustang, the integer version is 4 times > faster than the neon version that I delete. With effort I was > able to rewrite the neon version to come to within a factor of 1.1, > but it remained slower than the integer. To be fair, gcc6 makes > very good use of ldp, so the integer path is *also* loading 16 bytes > per insn. > > I can forward my standalone aarch64 benchmark if anyone is interested. > > Note however that at least the avx2 acceleration is still very much > a win, being about 3 times faster on my laptop. Of course, it's > handling 4 times as much data per loop as the integer version, so > one can still see the overhead caused by using vector insns. > > For grins I wrote an avx512 version, if someone has a skylake upon > which to test and benchmark. That requires additional configure > checks, so I didn't bother to include it here.
Thanks, queued for 2.8. Paolo