Hi Andy,

The measurement I sent yesterday for OpenSSL (with inlined T4 instruction support) was not quite accurate. Some of the T4 specific code you committed was not enabled when we tested, and I realized that__sparc__ was not defined on our system.
Thus, I changed "#if defined(__sparc__)" to "#if defined(__sparc)".
Now, we are seeing better number with OpenSSL.

                  sign    verify    sign/s verify/s
rsa 1024 bits 0.000351s 0.000024s   2852.9  42311.0
rsa 2048 bits 0.001258s 0.000047s    795.1  21128.6
rsa 4096 bits 0.006240s 0.000395s    160.3   2533.3

which is still slower than our t4 engine for 1k and 2k bit RSA sign:
                  sign    verify    sign/s verify/s
rsa 1024 bits 0.000237s 0.000028s   4221.9  36119.8
rsa 2048 bits 0.000876s 0.000075s   1141.7  13285.6
rsa 4096 bits 0.006341s 0.002139s    157.7    467.5


So, I enabled "warm-up" as suggested by you, but the performance number still look the same.

Here is the new bn_mul_mont_t4_8():

bn_mul_mont_t4_8()
    bn_mul_mont_t4_8:       8a 10 20 00  clr       %g5
    bn_mul_mont_t4_8+0x4:   88 10 3f 80  mov       -0x80, %g4
    bn_mul_mont_t4_8+0x8:   8b 29 70 20  sllx      %g5, 0x20, %g5
    bn_mul_mont_t4_8+0xc:   9d e3 80 04  save      %sp, %g4, %sp
    bn_mul_mont_t4_8+0x10:  9d e3 bf 80  save      %sp, -0x80, %sp
    bn_mul_mont_t4_8+0x14:  9d e3 bf 80  save      %sp, -0x80, %sp
    bn_mul_mont_t4_8+0x18:  9d e3 bf 80  save      %sp, -0x80, %sp
    bn_mul_mont_t4_8+0x1c:  9d e3 bf 80  save      %sp, -0x80, %sp
    bn_mul_mont_t4_8+0x20:  9d e3 bf 80  save      %sp, -0x80, %sp
    bn_mul_mont_t4_8+0x24:  9d e3 bf 80  save      %sp, -0x80, %sp
    bn_mul_mont_t4_8+0x28:  81 e8 00 00  restore
    bn_mul_mont_t4_8+0x2c:  81 e8 00 00  restore
    bn_mul_mont_t4_8+0x30:  81 e8 00 00  restore
    bn_mul_mont_t4_8+0x34:  81 e8 00 00  restore
    bn_mul_mont_t4_8+0x38:  81 e8 00 00  restore
    bn_mul_mont_t4_8+0x3c:  81 e8 00 00  restore
    bn_mul_mont_t4_8+0x40:  88 0b a0 01  and       %sp, 0x1, %g4
    bn_mul_mont_t4_8+0x44:  bc 11 40 1e  or        %g5, %fp, %fp
    bn_mul_mont_t4_8+0x48:  8a 11 00 05  or        %g4, %g5, %g5


I realized that, in sparct4-mont.pl, I see some 64-bit sparcv9 specific code, but my 64-bit library doesn't have those instructions. It looks like __arch64__ branch was taken. Did you expect the have the SOPARCV9_64BIT_STACK section to be compiled in?

.globl  bn_mul_mont_t4_$NUM
.align  32
bn_mul_mont_t4_$NUM:
#ifdef  __arch64__
        mov     0,$sentinel
        mov     -128,%g4
#elif defined(SPARCV9_64BIT_STACK)
        SPARC_LOAD_ADDRESS_LEAF(OPENSSL_sparcv9cap_P,%g1,%g5)
        ld      [%g1+0],%g1     ! OPENSSL_sparcv9_P[0]
        mov     -2047,%g4
        and     %g1,SPARCV9_64BIT_STACK,%g1
        movrz   %g1,0,%g4
        mov     -1,$sentinel
        add     %g4,-128,%g4
#else
        mov     -1,$sentinel
        mov     -128,%g4
#endif
        sllx    $sentinel,32,$sentinel
        save    %sp,%g4,%sp
#if 1
        save    %sp,-128,%sp    ! warm it up
        save    %sp,-128,%sp
<-- snip--->

Thank you,

-- misaki

I used 64-bit openssl binary to measure the performance.
With above in mind here is something to test. In
crypto/bn/asm/sparct4-mont.pl there is a register windows "warm-up"
sequence that is executed in 32-bit application context only
(benchmarking on Linux had shown that it's not necessary in 64-bit
application context). Could you test to engage it even in 64-bit
application context? I.e. open crypto/bn/asm/sparct4-mont.pl in text
editor, locate "warm it up" comment and replace "#ifndef __arch64__" in
preceding line with "#if 1".




Reply via email to