> More coming in. Here are preliminary results for 32- and 64-bit ARM. "Preliminary" means that they are incomplete and subject to change. But in a sense they underpin some of the points in previous post, both in message itself and source code commentary.
Consider 32-bit results. First column is assembly results for base 2^32 integer-only code in comparison to compiler-generate code. Second column is my result for NEON, and last column are results for Andrew Moon's NEON implementation, both are base 2^26. # IALU/gcc-4.4 NEON poly1305-opt # # Cortex-A5 6.30/+130% 2.96 4.90 # Cortex-A8 6.25/+115% 2.40 2.36 # Cortex-A9 5.10/+95% 2.56 2.25 # Cortex-A15 3.79/+85% 1.30 1.53 # Snapdragon S4 5.70/+100% 1.48 7.58(?) As mentioned earlier goal is "all-round" performance, i.e. near-optimal performance across *range* of platforms. Judging from Cortex-A9 result I have some room for improvement, hopefully it will benefit all processors. As for (?). It's not clear why poly1305-opt has performed so poorly on Snapdragon S4, it might happen that it failed to opt for NEON for some reason. I have no possibility to verify, because it's somebody else's mobile phone. Here are some results for base 2^64 integer-only implementation on 64-bit ARM, and base 2^26 32-bit NEON results. Latter means that I haven't ventured to NEON on 64-bit ARM yet, but as performance would be virtually same (because NEON instruction set capabilities are essentially same and it would be same base), we can use it to compare and assess options. # IALU gcc-4.9 gcc-4.7 NEON poly1305-opt # # Cortex-A53 2.72 4.16 9.09 1.57 2.52 # Cortex-A57 2.70 2.89 6.46 1.30 1.46 # Denver 1.45 2.09 5.63 1.50 1.34 IALU vs. compiler-generated code basically tells the reason why we program assembly, doesn't it? I mean if you compare assembly and gcc-4.9 on Cortex-A57, you'd probably say that assembly doesn't make sense. But if you look at remaining results, you'll see that you are kind of left to compiler's mercy and it's not that "mighty" in every situation. These results also confirm concern in commentary session in poly1305.c about base 2^64 not being optimal for every 64-bit case. Indeed, gcc-4.7 base 2^64 results are actually worse that base 2^32. Well, to be honest I was actually referring more to instruction set capabilities, but it can be extended to even to compiler. _______________________________________________ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
