Is big number montogomery multiplication as optimized as it can be for ARM64 as compared to X86-64 from the latest openssl github ? We are not seeing vmull ( or pmull/pmull2) instructions in armv8-mont.pl.
On an ARM cortex-A72 (1GHz) and E5-2620 (2.1 Ghz) we are seeing an order of 10 difference in RSA signing perf for 2048 bit keys. Ran openssl speed rsa2048 Here are the openssl speed numbers. *x86-64* [root@nuosrv2 openssl]# ./apps/openssl speed rsa2048 Doing 2048 bit private rsa's for 10s: 13134 2048 bit private RSA's in 9.97s Doing 2048 bit public rsa's for 10s: 379019 2048 bit public RSA's in 9.98s OpenSSL 1.1.1-dev xx XXX xxxx built on: reproducible build, date unspecified options:bn(64,64) rc4(16x,int) des(int) aes(partial) idea(int) blowfish(ptr) compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DRC4_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DGHASH_ASM -DECP_NISTZ256_ASM -DPADLOCK_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/local/ssl\"" -DENGINESDIR="\"/usr/local/lib64/engines-1.1\"" -Wa,--noexecstack sign verify sign/s verify/s *rsa 2048 bits 0.000759s 0.000026s 1317.4 37977.9* *arm64:* [root@juno openssl]# ./apps/openssl speed rsa2048 Doing 2048 bit private rsa's for 10s: 1319 2048 bit private RSA's in 9.92s Doing 2048 bit public rsa's for 10s: 49209 2048 bit public RSA's in 9.93s OpenSSL 1.1.1-dev xx XXX xxxx built on: reproducible build, date unspecified options:bn(64,64) rc4(char) des(int) aes(partial) idea(int) blowfish(ptr) compiler: gcc -DDSO_DLFCN -DHAVE_DLFCN_H -DNDEBUG -DOPENSSL_THREADS -DOPENSSL_NO_STATIC_ENGINE -DOPENSSL_PIC -DOPENSSL_BN_ASM_MONT -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DVPAES_ASM -DECP_NISTZ256_ASM -DPOLY1305_ASM -DOPENSSLDIR="\"/usr/local/ssl\"" -DENGINESDIR="\"/usr/local/lib/engines-1.1\"" -Wa,--noexecstack sign verify sign/s verify/s *rsa 2048 bits 0.007521s 0.000202s 133.0 4955.6* * ARM64 heavy hitters* 69.70% openssl libcrypto.so.1.1 [.] __bn_sqr8x_mont 18.64% openssl libcrypto.so.1.1 [.] __bn_mul4x_mont 4.92% openssl libcrypto.so.1.1 [.] MOD_EXP_CTIME_COPY_FROM_PREBUF 1.50% openssl libcrypto.so.1.1 [.] bn_mul_add_words * x86-64 heavy hitters* 30.93% openssl libcrypto.so.1.1 [.] __bn_sqrx8x_reduction 17.65% openssl libcrypto.so.1.1 [.] bn_sqrx8x_internal 12.65% openssl libcrypto.so.1.1 [.] mulx4x_internal 8.91% openssl libcrypto.so.1.1 [.] bn_mul_add_words 7.14% openssl libcrypto.so.1.1 [.] bn_mulx4x_mont Code looks different between x86 and ARM64. Is it due to the ISA or ARM64 not yet catching up with super efficient X86-64. Basically are we stuck with 1:5 (if we extrapolate A72 to 2Ghz) or is there an optimal code that we need to pick up for ARM64. I compiled openssl from github (latest). Any pointers will be extremely helpful. Thanks, -vijay
-- openssl-users mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-users