Hi Andy,
The measurement I sent yesterday for OpenSSL (with inlined T4
instruction support) was not quite accurate.
Some of the T4 specific code you committed was not enabled when we
tested, and I realized that__sparc__ was not defined on our system.
Thus, I changed "#if defined(__sparc__)" to "#if defined(__sparc)".
Now, we are seeing better number with OpenSSL.
sign verify sign/s verify/s
rsa 1024 bits 0.000351s 0.000024s 2852.9 42311.0
rsa 2048 bits 0.001258s 0.000047s 795.1 21128.6
rsa 4096 bits 0.006240s 0.000395s 160.3 2533.3
which is still slower than our t4 engine for 1k and 2k bit RSA sign:
sign verify sign/s verify/s
rsa 1024 bits 0.000237s 0.000028s 4221.9 36119.8
rsa 2048 bits 0.000876s 0.000075s 1141.7 13285.6
rsa 4096 bits 0.006341s 0.002139s 157.7 467.5
So, I enabled "warm-up" as suggested by you, but the performance number
still look the same.
Here is the new bn_mul_mont_t4_8():
bn_mul_mont_t4_8()
bn_mul_mont_t4_8: 8a 10 20 00 clr %g5
bn_mul_mont_t4_8+0x4: 88 10 3f 80 mov -0x80, %g4
bn_mul_mont_t4_8+0x8: 8b 29 70 20 sllx %g5, 0x20, %g5
bn_mul_mont_t4_8+0xc: 9d e3 80 04 save %sp, %g4, %sp
bn_mul_mont_t4_8+0x10: 9d e3 bf 80 save %sp, -0x80, %sp
bn_mul_mont_t4_8+0x14: 9d e3 bf 80 save %sp, -0x80, %sp
bn_mul_mont_t4_8+0x18: 9d e3 bf 80 save %sp, -0x80, %sp
bn_mul_mont_t4_8+0x1c: 9d e3 bf 80 save %sp, -0x80, %sp
bn_mul_mont_t4_8+0x20: 9d e3 bf 80 save %sp, -0x80, %sp
bn_mul_mont_t4_8+0x24: 9d e3 bf 80 save %sp, -0x80, %sp
bn_mul_mont_t4_8+0x28: 81 e8 00 00 restore
bn_mul_mont_t4_8+0x2c: 81 e8 00 00 restore
bn_mul_mont_t4_8+0x30: 81 e8 00 00 restore
bn_mul_mont_t4_8+0x34: 81 e8 00 00 restore
bn_mul_mont_t4_8+0x38: 81 e8 00 00 restore
bn_mul_mont_t4_8+0x3c: 81 e8 00 00 restore
bn_mul_mont_t4_8+0x40: 88 0b a0 01 and %sp, 0x1, %g4
bn_mul_mont_t4_8+0x44: bc 11 40 1e or %g5, %fp, %fp
bn_mul_mont_t4_8+0x48: 8a 11 00 05 or %g4, %g5, %g5
I realized that, in sparct4-mont.pl, I see some 64-bit sparcv9 specific
code, but my 64-bit library doesn't have those instructions.
It looks like __arch64__ branch was taken. Did you expect the have the
SOPARCV9_64BIT_STACK section to be compiled in?
.globl bn_mul_mont_t4_$NUM
.align 32
bn_mul_mont_t4_$NUM:
#ifdef __arch64__
mov 0,$sentinel
mov -128,%g4
#elif defined(SPARCV9_64BIT_STACK)
SPARC_LOAD_ADDRESS_LEAF(OPENSSL_sparcv9cap_P,%g1,%g5)
ld [%g1+0],%g1 ! OPENSSL_sparcv9_P[0]
mov -2047,%g4
and %g1,SPARCV9_64BIT_STACK,%g1
movrz %g1,0,%g4
mov -1,$sentinel
add %g4,-128,%g4
#else
mov -1,$sentinel
mov -128,%g4
#endif
sllx $sentinel,32,$sentinel
save %sp,%g4,%sp
#if 1
save %sp,-128,%sp ! warm it up
save %sp,-128,%sp
<-- snip--->
Thank you,
-- misaki
I used 64-bit openssl binary to measure the performance.
With above in mind here is something to test. In
crypto/bn/asm/sparct4-mont.pl there is a register windows "warm-up"
sequence that is executed in 32-bit application context only
(benchmarking on Linux had shown that it's not necessary in 64-bit
application context). Could you test to engage it even in 64-bit
application context? I.e. open crypto/bn/asm/sparct4-mont.pl in text
editor, locate "warm it up" comment and replace "#ifndef __arch64__" in
preceding line with "#if 1".