>> More coming in. > > Consider 32-bit results. First column is assembly results for base 2^32 > integer-only code in comparison to compiler-generate code. Second column > is my result for NEON, and last column are results for Andrew Moon's > NEON implementation, both are base 2^26. > > # IALU/gcc-4.4 NEON poly1305-opt > # > # Cortex-A5 6.30/+130% 2.96 4.90 > # Cortex-A8 6.25/+115% 2.40 2.36 > # Cortex-A9 5.10/+95% 2.56 2.25 > # Cortex-A15 3.79/+85% 1.30 1.53 > # Snapdragon S4 5.70/+100% 1.48 7.58(?) > > As mentioned earlier goal is "all-round" performance, i.e. near-optimal > performance across *range* of platforms. Judging from Cortex-A9 result I > have some room for improvement, hopefully it will benefit all > processors.
After experimenting I'm leaning toward settling for above results. A little bit improved on couple of CPUs, but same approach. What are the approaches? When pulling input data and performing due conversion to base 2^26 it's possible to a) do it completely in NEON (above results); b) do it with integer-only instructions and move data to NEON with inter-register vmov; c) do it with integer-only instructions and transfer data to NEON through memory. It was found that b) gives me ~8% improvement on Cortex-A15 and Snapdragon S4, but hurts low-end Cortex-A5 as well as Cortex-A7 by 15/12%. Then c) performs as b) on Cortex-A15 and S4, improves Cortex-A9 by 10%, but losses on low-end go over 20%. Keep in mind that there is certain asymmetry in how losses vs. gains are presented. For example when we measure 25% regression it means that original is 33% faster. Anyway, all-NEON approach appears to provide best "all-round" performance.
poly1305-armv4.pl
Description: Perl program
_______________________________________________ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
