Re: 0.9.2b Sparc problem

Andy Polyakov Tue, 30 Mar 1999 08:59:04 -0500

Howdy, kind people! Long one, huh?
> >
> >> Undefined                       first referenced
> >>  symbol                             in file
> >> bn_mul_comba4                       ../libcrypto.a(bn_mul.o)
> >> bn_mul_comba8                       ../libcrypto.a(bn_mul.o)
> >> bn_sqr_comba4                       ../libcrypto.a(bn_sqr.o)
> >> bn_sqr_comba8                       ../libcrypto.a(bn_sqr.o)
> >> bn_sub_words                        ../libcrypto.a(bn_mul.o)
> >> bn_div_words                        ../libcrypto.a(bn_word.o)
> >> ld: fatal: Symbol referencing errors. No output written to openssl
> >>
> >> Either asm/sparc.s needs to be fixed, ...
So I took the challenge and came up with original UltraSPARC
implementation you can find attached to this letter:-) Read the
questions-n-answers paragraph in the beginning of the file to figure out
what, why and how come. Have fun!

And while we're on subject I have couple of related comments. First have
at look at this patch:

*** Configure.orig      Fri Mar 12 21:31:13 1999
--- Configure   Sat Mar 27 13:46:51 1999
***************
*** 100,106 ****
        -lsocket -lnsl:BN_LLONG RC4_CHAR DES_PTR DES_UNROLL
BF_PTR:asm/sparc.o::",
  # SC4.0 is ok, better than gcc, except for the bignum stuff.
  # -fast slows things like DES down quite a lot
! "solaris-sparc-sc4","cc:-xO5 -Xa -DB_ENDIAN:-lsocket -lnsl:\
        BN_LLONG RC4_CHAR DES_PTR DES_RISC1 DES_UNROLL
BF_PTR:asm/sparc.o::",
  "solaris-usparc-sc4","cc:-xtarget=ultra -xarch=v8plus -Xa -xO5
-DB_ENDIAN:\
        -lsocket -lnsl:\
--- 100,106 ----
        -lsocket -lnsl:BN_LLONG RC4_CHAR DES_PTR DES_UNROLL
BF_PTR:asm/sparc.o::",
  # SC4.0 is ok, better than gcc, except for the bignum stuff.
  # -fast slows things like DES down quite a lot
! "solaris-sparc-sc4","cc:-xarch=v8 -xstrconst -xO5 -xdepend -Xa
-DB_ENDIAN -DBN_DIV2W:-lsocket -lnsl:\
        BN_LLONG RC4_CHAR DES_PTR DES_RISC1 DES_UNROLL
BF_PTR:asm/sparc.o::",
  "solaris-usparc-sc4","cc:-xtarget=ultra -xarch=v8plus -Xa -xO5
-DB_ENDIAN:\
        -lsocket -lnsl:\

1. As it says in the attached file "both SC4.x and gcc generate rather
decent code off bn_asm.c", but *only* *if* are explicitely "instructed
to generate SPARC v8 code." Now by default gcc *is* instructed to do so
with -mv8 flag, but not SC4.x. How come? So -xarch=v8 comes into the
picture.

2. By default SC4.x puts string literal into data segment, thus
increasing runtime private footprint of the program. -xstrconst in turn
keeps 'em in read-only segment that is *shared* between multiple copies
of the program.

3. -xdepend is good, but probably redundant in openssl case...

4. Why hardware division is left out? -DBN_DIV2W takes care of this.
Well, sort of... gcc sometimes does generate udiv, sometimes doesn't,
SC4.x in turn always generates call to __udiv64 and have run-time linker
decide what it's gonna be. So it (hardware division) probably ought to
be implemented in assembler anyway... And I presumably have put
preprocessor definition in wrong line, it probably belongs in the next
one... In either case I want to tell that I don't feel comfortable with
bn_div_words. It looks to me that those functions invoking bn_div_words
would benefit more if *larger* portions of loop bodies surrounding the
call are implemented in assembler. Any opposite opinions?

And finally. It doesn't make any difference to my UltraSPARC-specific
implementation (as I exploit branches on register condition with
prediction) but in compiler generated and v8 cases unrolling loops in
sligtly different way would make some extra good, e.g. in
bn_mul_add_words case:

        for (num2=num>>2,i=0;i<num2;i++) {
                mul_add(rp[0],ap[0],w,c1);
                mul_add(rp[1],ap[1],w,c1);
                mul_add(rp[2],ap[2],w,c1);
                mul_add(rp[3],ap[3],w,c1);
                ap+=4; rp+=4;
        }
        for (num&=3;;) {
                mul_add(rp[0],ap[0],w,c1);
                if (--num == 0) break;
                mul_add(rp[1],ap[1],w,c1);
                if (--num == 0) break;
                mul_add(rp[2],ap[2],w,c1);
                if (--num == 0) break;
                mul_add(rp[3],ap[3],w,c1);
                if (--num == 0) break;
                ap+=4;
                rp+=4;
        }


Cheers. Andy.

.ident  "bn_asm.sparc.v8plus.S, Version 1.0"
.ident  "SPARC v9 ISA artwork by Andy Polyakov <[EMAIL PROTECTED]>"

/*
 * ====================================================================
 * Copyright (c) 1999 Andy Polyakov <[EMAIL PROTECTED]>.
 *
 * Rights for redistribution and usage in source and binary forms are
 * granted as long as above copyright notices are retained. Warranty
 * of any kind is (of course:-) disclaimed.
 * ====================================================================
 */

/*
 * This is my modest contributon to OpenSSL project (see
 * http://www.openssl.org/ for more information about it) and is
 * a drop-in UltraSPARC ISA replacement for crypto/bn/bn_asm.c
 * module.
 *
 * Questions-n-answers.
 *
 * Q. How to compile?
 * A. With SC4.x:
 *
 *      cc -xarch=v8plus -c bn_asm.sparc.v8plus.S -o bn_asm.o
 *
 *    and with gcc:
 *
 *      gcc -Wa,-xarch=v8plus -c bn_asm.sparc.v8plus.S -o bn_asm.o
 *
 *    Quick-n-dirty way to fuse the module into the library.
 *    Provided that the library is already configured and built
 *    (in 0.9.2 case with no_asm option):
 *
 *      # cd crypto/bn
 *      # cp /some/place/bn_asm.sparc.v8plus.S .
 *      # cc -xarch=v8plus -c bn_asm.sparc.v8plus.S -o bn_asm.o
 *      # make
 *      # cd ../..
 *      # make; make test
 *
 *    Quick-n-dirty way to get rid of it:
 *
 *      # cd crypto/bn
 *      # touch bn_asm.c
 *      # make
 *      # cd ../..
 *      # make; make test
 *
 * Q. Why just UltraSPARC? What about SuperSPARC?
 * A. When instructed to generate SPARC v8 and later code both SC4.x
 *    and gcc generate rather decent code off bn_asm.c and I don't
 *    find it worth the effort to implement it in assembler. As for
 *    UltraSPARC none of the available compilers (not even SC5.0)
 *    attempt to take advantage of 64-bit registers under 32-bit kernels
 *    even though it's perfectly possible (see next question). Well,
 *    I actually am not being 100% sincere here, because the largest
 *    functions found here, namely bn_*_comba[48], are pure SPARC v8
 *    (see comment later in code for explanation) and you *should* 
 *    benefit from the assembler implementation even on good-n-old 
 *    SuperSPARC. If there's an acrtual need for complete v8
 *    implementation, I'd recommend to have C compiler generate
 *    assembler listing and copy-n-paste generated code for functions
 *    other than bn_*_comba[48].
 *
 * Q. 64-bit registers under 32-bit kernels? Does it work?
 * A. You can't address *all* registers as 64-bit wide, but only
 *    %o0-%o5 and %g1-%g4 *and* only in leaf functions, i.e. those
 *    that never call any other functions. All functions in this module
 *    are leaf and 10 registers is a handful. As a matter of fact none
 *    "comba" routines don't require even that much and I could even
 *    afford to not allocate own stack frame for 'em:-)
 *
 * Q. What about 64-bit kernels?
 * A. What about 'em? Just kidding:-) I unfortunately never had a
 *    chance to test it, but the below code is 64-bit safe and you
 *    shouldn't have any problem with it. What I probably am saying
 *    here is that I appreciate feedback on the matter... And yes,
 *    you have to feed compiler with -xarch=v9 command line option
 *    instead of -xarch=v8plus.
 *
 * Q. What about sharable libraries?
 * A. What about 'em? Kidding again:-) Code does *not* contain any
 *    code position dependencies and it's safe to include it into
 *    sharable library as is.
 *
 * Q. How much faster does it get?
 * A. Do you have good benchmark? In either case I experience 25-30%
 *    improvement on UltraSPARC-1 with crypto/bn/expspeed.c test
 *    program. I used SC4.2 with -xarch=v8 -xstrconst -xO5 -xdepend
 *    to compile bn_asm.c used as a reference.
 *
 */

/*
 * Basically the only difference between 32-bit and 64-bit versions
 * is size of minimal stack frame that subroutine should allocate.
 */
#ifdef __sparcv9
#define FRAME_SIZE      -192
#else
#define FRAME_SIZE      -96
#endif

.section        ".text",#alloc,#execinstr
.file           "bn_asm.sparc.v8plus.S"

.align  32
.global bn_mul_add_words
/*
 * BN_ULONG bn_mul_add_words(rp,ap,num,w)
 * BN_ULONG *rp,*ap;
 * int num;
 * BN_ULONG w;
 *
 * Register usage map:
 *      %o0     = rp
 *      %o1     = ap
 *      %o2     = num
 *      %o3     = w
 *      %o4     = *rp
 *      %o5     = *ap
 *      %g1     = carry
 */
bn_mul_add_words:
        cmp     %o2,0
        bg,a    %icc,.L_bn_mul_add_words_proceed
        lduw    [%o1],%o5
        retl
        clr     %o0

.L_bn_mul_add_words_proceed:
        clruw   %o3
        lduw    [%o0],%o4
        mulx    %o3,%o5,%o5
        dec     %o2
        add     %o4,%o5,%o4
        nop
        srlx    %o4,32,%g1
        brz,pn  %o2,.L_bn_mul_add_words_ret
        stuw    %o4,[%o0]
        nop

        lduw    [%o1+4],%o5
.L_bn_mul_add_words_loop:
        mulx    %o3,%o5,%o5
        lduw    [%o0+4],%o4
        add     %o5,%g1,%o5
        dec     %o2
        add     %o4,%o5,%o4
        srlx    %o4,32,%g1
        brz,pn  %o2,.L_bn_mul_add_words_ret
        stuw    %o4,[%o0+4]

        lduw    [%o1+8],%o5
        mulx    %o3,%o5,%o5
        lduw    [%o0+8],%o4
        add     %o5,%g1,%o5
        dec     %o2
        add     %o4,%o5,%o4
        srlx    %o4,32,%g1
        brz,pn  %o2,.L_bn_mul_add_words_ret
        stuw    %o4,[%o0+8]

        lduw    [%o1+12],%o5
        mulx    %o3,%o5,%o5
        lduw    [%o0+12],%o4
        add     %o5,%g1,%o5
        dec     %o2
        add     %o4,%o5,%o4
        srlx    %o4,32,%g1
        brz,pn  %o2,.L_bn_mul_add_words_ret
        stuw    %o4,[%o0+12]

        lduw    [%o1+16],%o5
        inc     16,%o0
        mulx    %o3,%o5,%o5
        lduw    [%o0],%o4
        add     %o5,%g1,%o5
        dec     %o2
        add     %o4,%o5,%o4
        inc     16,%o1
        srlx    %o4,32,%g1
        stuw    %o4,[%o0]
        brnz,a  %o2,.L_bn_mul_add_words_loop
        lduw    [%o1+4],%o5

.L_bn_mul_add_words_ret:
        retl
        mov     %g1,%o0

.type   bn_mul_add_words,2
.size   bn_mul_add_words,(.-bn_mul_add_words)

.align  32
.global bn_mul_words
/*
 * BN_ULONG bn_mul_words(rp,ap,num,w)
 * BN_ULONG *rp,*ap;
 * int num;
 * BN_ULONG w;
 *
 * See bn_mul_add_words for register map.
 */
bn_mul_words:
        cmp     %o2,0
        bg,a    %icc,.L_bn_mul_words_proceed
        lduw    [%o1],%o5
        retl
        clr     %o0
        nop
        nop

.L_bn_mul_words_proceed:
        clruw   %o3
        nop

        mulx    %o3,%o5,%o4
        dec     %o2
        srlx    %o4,32,%g1
        brz,pn  %o2,.L_bn_mul_words_ret
        stuw    %o4,[%o0]
        nop

        lduw    [%o1+4],%o5
.L_bn_mul_words_loop:
        mulx    %o3,%o5,%o4
        dec     %o2
        add     %o4,%g1,%o4
        srlx    %o4,32,%g1
        brz,pn  %o2,.L_bn_mul_words_ret
        stuw    %o4,[%o0+4]

        lduw    [%o1+8],%o5
        mulx    %o3,%o5,%o4
        dec     %o2
        add     %o4,%g1,%o4
        srlx    %o4,32,%g1
        brz,pn  %o2,.L_bn_mul_words_ret
        stuw    %o4,[%o0+8]

        lduw    [%o1+12],%o5
        inc     16,%o1
        mulx    %o3,%o5,%o4
        dec     %o2
        add     %o4,%g1,%o4
        srlx    %o4,32,%g1
        brz,pn  %o2,.L_bn_mul_words_ret
        stuw    %o4,[%o0+12]

        lduw    [%o1],%o5
        inc     16,%o0
        mulx    %o3,%o5,%o4
        dec     %o2
        add     %o4,%g1,%o4
        srlx    %o4,32,%g1
        stuw    %o4,[%o0]
        brnz,a  %o2,.L_bn_mul_words_loop
        lduw    [%o1+4],%o5

.L_bn_mul_words_ret:
        retl
        mov     %g1,%o0

.type   bn_mul_words,2
.size   bn_mul_words,(.-bn_mul_words)

.align  32
.global bn_sqr_words
/*
 * void bn_sqr_words(r,a,n)
 * BN_ULONG *r,*a;
 * int n;
 *
 * Register usage map:
 *      %o0     = rp
 *      %o1     = ap
 *      %o2     = num
 *      %o4     = *rp
 *      %o5     = *ap
 */
bn_sqr_words:
        cmp     %o2,0
        bg,a    %icc,.L_bn_sqr_words_proceed
        lduw    [%o1],%o5
        retl
        clr     %o0
        nop
        nop
        nop
        nop
        
.L_bn_sqr_words_proceed:
        mulx    %o5,%o5,%o4
        dec     %o2
        stuw    %o4,[%o0]
        srlx    %o4,32,%o3
        brz,pn  %o2,.L_bn_sqr_words_ret
        stuw    %o3,[%o0+4]

        lduw    [%o1+4],%o5
.L_bn_sqr_words_loop:
        mulx    %o5,%o5,%o4
        dec     %o2
        stuw    %o4,[%o0+8]
        srlx    %o4,32,%o3
        brz,pn  %o2,.L_bn_sqr_words_ret
        stuw    %o3,[%o0+12]

        lduw    [%o1+8],%o5
        mulx    %o5,%o5,%o4
        dec     %o2
        stuw    %o4,[%o0+16]
        srlx    %o4,32,%o3
        brz,pn  %o2,.L_bn_sqr_words_ret
        stuw    %o3,[%o0+20]

        lduw    [%o1+12],%o5
        mulx    %o5,%o5,%o4
        dec     %o2
        stuw    %o4,[%o0+24]
        srlx    %o4,32,%o3
        inc     16,%o1
        brz,pn  %o2,.L_bn_sqr_words_ret
        stuw    %o3,[%o0+28]

        lduw    [%o1],%o5
        inc     32,%o0
        mulx    %o5,%o5,%o4
        dec     %o2
        stuw    %o4,[%o0]
        srlx    %o4,32,%o3
        stuw    %o3,[%o0+4]
        brnz,a  %o2,.L_bn_sqr_words_loop
        lduw    [%o1+4],%o5

.L_bn_sqr_words_ret:
        retl
        clr     %o0

.type   bn_sqr_words,2
.size   bn_sqr_words,(.-bn_sqr_words)

.align  32
.global bn_div_words
/*
 * BN_ULONG bn_div_words(h,l,d)
 * BN_ULONG h,l,d;
 */
bn_div_words:
        sllx    %o0,32,%o0
        or      %o0,%o1,%o0
        udivx   %o0,%o2,%o0
        retl
        clruw   %o0

.type   bn_div_words,2
.size   bn_div_words,(.-bn_div_words)

.align  32
.global bn_add_words
/*
 * BN_ULONG bn_add_words(rp,ap,bp,n)
 * BN_ULONG *rp,*ap,*bp;
 * int n;
 *
 * Register usage map:
 *      %o0     = rp
 *      %o1     = ap
 *      %o2     = bp
 *      %o3     = num
 *      %o4     = *ap
 *      %o5     = *bp
 */
bn_add_words:
        cmp     %o3,0
        bg,a    %icc,.L_bn_add_words_proceed
        lduw    [%o1],%o4
        retl
        clr     %o0

.L_bn_add_words_proceed:
        lduw    [%o2],%o5
        dec     %o3
        addcc   %o5,%o4,%o5
        brz,pn  %o3,.L_bn_add_words_ret
        stuw    %o5,[%o0]

        lduw    [%o1+4],%o4
.L_bn_add_words_loop:
        dec     %o3
        lduw    [%o2+4],%o5
        addccc  %o5,%o4,%o5
        brz,pn  %o3,.L_bn_add_words_ret
        stuw    %o5,[%o0+4]

        lduw    [%o1+8],%o4
        dec     %o3
        lduw    [%o2+8],%o5
        addccc  %o5,%o4,%o5
        brz,pn  %o3,.L_bn_add_words_ret
        stuw    %o5,[%o0+8]

        lduw    [%o1+12],%o4
        dec     %o3
        lduw    [%o2+12],%o5
        inc     16,%o1
        addccc  %o5,%o4,%o5
        brz,pn  %o3,.L_bn_add_words_ret
        stuw    %o5,[%o0+12]

        inc     16,%o2
        lduw    [%o1],%o4
        inc     16,%o0
        lduw    [%o2],%o5
        dec     %o3
        addccc  %o5,%o4,%o5
        stuw    %o5,[%o0]
        brnz,a  %o3,.L_bn_add_words_loop
        lduw    [%o1+4],%o4

.L_bn_add_words_ret:
        clr     %o0
        retl
        movcs   %icc,1,%o0

.type   bn_add_words,2
.size   bn_add_words,(.-bn_add_words)

.align  32
.global bn_sub_words
/*
 * BN_ULONG bn_sub_words(rp,ap,bp,n)
 * BN_ULONG *rp,*ap,*bp;
 * int n;
 *
 * Register usage map:
 *      %o0     = rp
 *      %o1     = ap
 *      %o2     = bp
 *      %o3     = num
 *      %o4     = *ap
 *      %o5     = *bp
 */
bn_sub_words:
        cmp     %o3,0
        bg,a    %icc,.L_bn_sub_words_proceed
        lduw    [%o1],%o4
        retl
        clr     %o0

.L_bn_sub_words_proceed:
        lduw    [%o2],%o5
        dec     %o3
        subcc   %o4,%o5,%o5
        brz,pn  %o3,.L_bn_sub_words_ret
        stuw    %o5,[%o0]

        lduw    [%o1+4],%o4
.L_bn_sub_words_loop:
        dec     %o3
        lduw    [%o2+4],%o5
        subccc  %o4,%o5,%o5
        brz,pn  %o3,.L_bn_sub_words_ret
        stuw    %o5,[%o0+4]

        lduw    [%o1+8],%o4
        dec     %o3
        lduw    [%o2+8],%o5
        subccc  %o4,%o5,%o5
        brz,pn  %o3,.L_bn_sub_words_ret
        stuw    %o5,[%o0+8]

        lduw    [%o1+12],%o4
        dec     %o3
        lduw    [%o2+12],%o5
        inc     16,%o1
        subccc  %o4,%o5,%o5
        brz,pn  %o3,.L_bn_sub_words_ret
        stuw    %o5,[%o0+12]

        inc     16,%o2
        lduw    [%o1],%o4
        inc     16,%o0
        lduw    [%o2],%o5
        dec     %o3
        subccc  %o4,%o5,%o5
        stuw    %o5,[%o0]
        brnz,a  %o3,.L_bn_sub_words_loop
        lduw    [%o1+4],%o4

.L_bn_sub_words_ret:
        clr     %o0
        retl
        movcs   %icc,1,%o0

.type   bn_sub_words,2
.size   bn_sub_words,(.-bn_sub_words)

/*
 * Following code is pure SPARC V8! Trouble is that it's not feasible
 * to implement the mumbo-jumbo in less "V9" instructions:-( At least not
 * under 32-bit kernel. The reason is that you'd have to shuffle registers
 * all the time as only few (well, 10:-) are fully (i.e. all 64 bits)
 * preserved by kernel during context switch. But even under 64-bit kernel
 * you won't gain much because in the lack of "add with extended carry"
 * instruction you'd have to issue 'clr %rx; movcs %xcc,1,%rx;
 * add %rd,%rx,%rd' sequence in place of 'addxcc %rx,%ry,%rx;
 * addx %rz,%g0,%rz' pair in 32-bit case.
 *
 *                                                      Andy.
 */

/*
 * Here is register usage map for *all* routines below.
 */
#define a_0     %l0
#define a_0_    [%i1]
#define a_1     %l1
#define a_1_    [%i1+4]
#define a_2     %l2
#define a_2_    [%i1+8]
#define a_3     %l3
#define a_3_    [%i1+12]
#define a_4     %l4
#define a_4_    [%i1+16]
#define a_5     %l5
#define a_5_    [%i1+20]
#define a_6     %l6
#define a_6_    [%i1+24]
#define a_7     %l7
#define a_7_    [%i1+28]
#define b_0     %g1
#define b_0_    [%i2]
#define b_1     %g2
#define b_1_    [%i2+4]
#define b_2     %g3
#define b_2_    [%i2+8]
#define b_3     %g4
#define b_3_    [%i2+12]
#define b_4     %i3
#define b_4_    [%i2+16]
#define b_5     %i4
#define b_5_    [%i2+20]
#define b_6     %i5
#define b_6_    [%i2+24]
#define b_7     %o5
#define b_7_    [%i2+28]
#define c_1     %o2
#define c_2     %o3
#define c_3     %o4
#define t_1     %o0
#define t_2     %o1

.align  32
.global bn_mul_comba8
/*
 * void bn_mul_comba8(r,a,b)
 * BN_ULONG *r,*a,*b;
 */
bn_mul_comba8:
        save    %sp,FRAME_SIZE,%sp
        ld      a_0_,a_0
        ld      b_0_,b_0
        umul    a_0,b_0,c_1     !mul_add_c(a[0],b[0],c1,c2,c3);
        ld      b_1_,b_1
        rd      %y,c_2
        st      c_1,[%i0]       !r[0]=c1;

        umul    a_0,b_1,t_1     !mul_add_c(a[0],b[1],c2,c3,c1);
        ld      a_1_,a_1
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  %g0,t_2,c_3
        addx    %g0,%g0,c_1
        umul    a_1,b_0,t_1     !mul_add_c(a[1],b[0],c2,c3,c1);
        ld      a_2_,a_2
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+4]     !r[1]=c2;

        umul    a_2,b_0,t_1     !mul_add_c(a[2],b[0],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        umul    a_1,b_1,t_1     !mul_add_c(a[1],b[1],c3,c1,c2);
        ld      b_2_,b_2
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_0,b_2,t_1     !mul_add_c(a[0],b[2],c3,c1,c2);
        ld      b_3_,b_3
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+8]     !r[2]=c3;

        umul    a_0,b_3,t_1     !mul_add_c(a[0],b[3],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    %g0,%g0,c_3
        umul    a_1,b_2,t_1     !mul_add_c(a[1],b[2],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_2,b_1,t_1     !mul_add_c(a[2],b[1],c1,c2,c3);
        ld      a_3_,a_3
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_3,b_0,t_1     !mul_add_c(a[3],b[0],c1,c2,c3);
        ld      a_4_,a_4
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        st      c_1,[%i0+12]    !r[3]=c1;

        umul    a_4,b_0,t_1     !mul_add_c(a[4],b[0],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    %g0,%g0,c_1
        umul    a_3,b_1,t_1     !mul_add_c(a[3],b[1],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_2,b_2,t_1     !mul_add_c(a[2],b[2],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_1,b_3,t_1     !mul_add_c(a[1],b[3],c2,c3,c1);
        ld      b_4_,b_4
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_0,b_4,t_1     !mul_add_c(a[0],b[4],c2,c3,c1);
        ld      b_5_,b_5
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+16]    !r[4]=c2;

        umul    a_0,b_5,t_1     !mul_add_c(a[0],b[5],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        umul    a_1,b_4,t_1     !mul_add_c(a[1],b[4],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_2,b_3,t_1     !mul_add_c(a[2],b[3],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_3,b_2,t_1     !mul_add_c(a[3],b[2],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_4,b_1,t_1     !mul_add_c(a[4],b[1],c3,c1,c2);
        ld      a_5_,a_5
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_5,b_0,t_1     !mul_add_c(a[5],b[0],c3,c1,c2);
        ld      a_6_,a_6
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+20]    !r[5]=c3;

        umul    a_6,b_0,t_1     !mul_add_c(a[6],b[0],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    %g0,%g0,c_3
        umul    a_5,b_1,t_1     !mul_add_c(a[5],b[1],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_4,b_2,t_1     !mul_add_c(a[4],b[2],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_3,b_3,t_1     !mul_add_c(a[3],b[3],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_2,b_4,t_1     !mul_add_c(a[2],b[4],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_1,b_5,t_1     !mul_add_c(a[1],b[5],c1,c2,c3);
        ld      b_6_,b_6
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_0,b_6,t_1     !mul_add_c(a[0],b[6],c1,c2,c3);
        ld      b_7_,b_7
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        st      c_1,[%i0+24]    !r[6]=c1;

        umul    a_0,b_7,t_1     !mul_add_c(a[0],b[7],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    %g0,%g0,c_1
        umul    a_1,b_6,t_1     !mul_add_c(a[1],b[6],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_2,b_5,t_1     !mul_add_c(a[2],b[5],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_3,b_4,t_1     !mul_add_c(a[3],b[4],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_4,b_3,t_1     !mul_add_c(a[4],b[3],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_5,b_2,t_1     !mul_add_c(a[5],b[2],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_6,b_1,t_1     !mul_add_c(a[6],b[1],c2,c3,c1);
        ld      a_7_,a_7
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_7,b_0,t_1     !mul_add_c(a[7],b[0],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+28]    !r[7]=c2;

        umul    a_7,b_1,t_1     !mul_add_c(a[7],b[1],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        umul    a_6,b_2,t_1     !mul_add_c(a[6],b[2],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_5,b_3,t_1     !mul_add_c(a[5],b[3],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_4,b_4,t_1     !mul_add_c(a[4],b[4],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_3,b_5,t_1     !mul_add_c(a[3],b[5],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_2,b_6,t_1     !mul_add_c(a[2],b[6],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_1,b_7,t_1     !mul_add_c(a[1],b[7],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+32]    !r[8]=c3;

        umul    a_2,b_7,t_1     !mul_add_c(a[2],b[7],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    %g0,%g0,c_3
        umul    a_3,b_6,t_1     !mul_add_c(a[3],b[6],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_4,b_5,t_1     !mul_add_c(a[4],b[5],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_5,b_4,t_1     !mul_add_c(a[5],b[4],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_6,b_3,t_1     !mul_add_c(a[6],b[3],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_7,b_2,t_1     !mul_add_c(a[7],b[2],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        st      c_1,[%i0+36]    !r[9]=c1;

        umul    a_7,b_3,t_1     !mul_add_c(a[7],b[3],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    %g0,%g0,c_1
        umul    a_6,b_4,t_1     !mul_add_c(a[6],b[4],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_5,b_5,t_1     !mul_add_c(a[5],b[5],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_4,b_6,t_1     !mul_add_c(a[4],b[6],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_3,b_7,t_1     !mul_add_c(a[3],b[7],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+40]    !r[10]=c2;

        umul    a_4,b_7,t_1     !mul_add_c(a[4],b[7],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        umul    a_5,b_6,t_1     !mul_add_c(a[5],b[6],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_6,b_5,t_1     !mul_add_c(a[6],b[5],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_7,b_4,t_1     !mul_add_c(a[7],b[4],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+44]    !r[11]=c3;

        umul    a_7,b_5,t_1     !mul_add_c(a[7],b[5],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    %g0,%g0,c_3
        umul    a_6,b_6,t_1     !mul_add_c(a[6],b[6],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_5,b_7,t_1     !mul_add_c(a[5],b[7],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        st      c_1,[%i0+48]    !r[12]=c1;

        umul    a_6,b_7,t_1     !mul_add_c(a[6],b[7],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    %g0,%g0,c_1
        umul    a_7,b_6,t_1     !mul_add_c(a[7],b[6],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+52]    !r[13]=c2;

        umul    a_7,b_7,t_1     !mul_add_c(a[7],b[7],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        st      c_3,[%i0+56]    !r[14]=c3;
        st      c_1,[%i0+60]    !r[15]=c1;

        ret
        restore %g0,%g0,%o0

.type   bn_mul_comba8,2
.size   bn_mul_comba8,(.-bn_mul_comba8)

.align  32

.global bn_mul_comba4
/*
 * void bn_mul_comba4(r,a,b)
 * BN_ULONG *r,*a,*b;
 */
bn_mul_comba4:
        save    %sp,FRAME_SIZE,%sp
        ld      a_0_,a_0
        ld      b_0_,b_0
        umul    a_0,b_0,c_1     !mul_add_c(a[0],b[0],c1,c2,c3);
        ld      b_1_,b_1
        rd      %y,c_2
        st      c_1,[%i0]       !r[0]=c1;

        umul    a_0,b_1,t_1     !mul_add_c(a[0],b[1],c2,c3,c1);
        ld      a_1_,a_1
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  %g0,t_2,c_3
        addx    %g0,%g0,c_1
        umul    a_1,b_0,t_1     !mul_add_c(a[1],b[0],c2,c3,c1);
        ld      a_2_,a_2
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+4]     !r[1]=c2;

        umul    a_2,b_0,t_1     !mul_add_c(a[2],b[0],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        umul    a_1,b_1,t_1     !mul_add_c(a[1],b[1],c3,c1,c2);
        ld      b_2_,b_2
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_0,b_2,t_1     !mul_add_c(a[0],b[2],c3,c1,c2);
        ld      b_3_,b_3
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+8]     !r[2]=c3;

        umul    a_0,b_3,t_1     !mul_add_c(a[0],b[3],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    %g0,%g0,c_3
        umul    a_1,b_2,t_1     !mul_add_c(a[1],b[2],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_2,b_1,t_1     !mul_add_c(a[2],b[1],c1,c2,c3);
        ld      a_3_,a_3
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_3,b_0,t_1     !mul_add_c(a[3],b[0],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        st      c_1,[%i0+12]    !r[3]=c1;

        umul    a_3,b_1,t_1     !mul_add_c(a[3],b[1],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    %g0,%g0,c_1
        umul    a_2,b_2,t_1     !mul_add_c(a[2],b[2],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_1,b_3,t_1     !mul_add_c(a[1],b[3],c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+16]    !r[4]=c2;

        umul    a_2,b_3,t_1     !mul_add_c(a[2],b[3],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        umul    a_3,b_2,t_1     !mul_add_c(a[3],b[2],c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+20]    !r[5]=c3;

        umul    a_3,b_3,t_1     !mul_add_c(a[3],b[3],c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        st      c_1,[%i0+24]    !r[6]=c1;
        st      c_2,[%i0+28]    !r[7]=c2;
        
        ret
        restore %g0,%g0,%o0

.type   bn_mul_comba4,2
.size   bn_mul_comba4,(.-bn_mul_comba4)

.align  32

.global bn_sqr_comba8
bn_sqr_comba8:
        save    %sp,FRAME_SIZE,%sp
        ld      a_0_,a_0
        umul    a_0,a_0,c_1     !sqr_add_c(a,0,c1,c2,c3);
        ld      a_1_,a_1
        rd      %y,c_2
        st      c_1,[%i0]       !r[0]=c1;

        umul    a_0,a_1,t_1     !sqr_add_c2(a,1,0,c2,c3,c1);
        ld      a_2_,a_2
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  %g0,t_2,c_3
        addx    %g0,%g0,c_1
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+4]     !r[1]=c2;

        umul    a_2,a_0,t_1     !sqr_add_c2(a,2,0,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_1,a_1,t_1     !sqr_add_c(a,1,c3,c1,c2);
        ld      a_3_,a_3
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+8]     !r[2]=c3;

        umul    a_0,a_3,t_1     !sqr_add_c2(a,3,0,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    %g0,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_1,a_2,t_1     !sqr_add_c2(a,2,1,c1,c2,c3);
        ld      a_4_,a_4
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        st      c_1,[%i0+12]    !r[3]=c1;

        umul    a_4,a_0,t_1     !sqr_add_c2(a,4,0,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    %g0,%g0,c_1
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_3,a_1,t_1     !sqr_add_c2(a,3,1,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_2,a_2,t_1     !sqr_add_c(a,2,c2,c3,c1);
        ld      a_5_,a_5
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+16]    !r[4]=c2;

        umul    a_0,a_5,t_1     !sqr_add_c2(a,5,0,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_1,a_4,t_1     !sqr_add_c2(a,4,1,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_2,a_3,t_1     !sqr_add_c2(a,3,2,c3,c1,c2);
        ld      a_6_,a_6
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+20]    !r[5]=c3;

        umul    a_6,a_0,t_1     !sqr_add_c2(a,6,0,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    %g0,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_5,a_1,t_1     !sqr_add_c2(a,5,1,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_4,a_2,t_1     !sqr_add_c2(a,4,2,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_3,a_3,t_1     !sqr_add_c(a,3,c1,c2,c3);
        ld      a_7_,a_7
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        st      c_1,[%i0+24]    !r[6]=c1;

        umul    a_0,a_7,t_1     !sqr_add_c2(a,7,0,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    %g0,%g0,c_1
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_1,a_6,t_1     !sqr_add_c2(a,6,1,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_2,a_5,t_1     !sqr_add_c2(a,5,2,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_3,a_4,t_1     !sqr_add_c2(a,4,3,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+28]    !r[7]=c2;

        umul    a_7,a_1,t_1     !sqr_add_c2(a,7,1,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_6,a_2,t_1     !sqr_add_c2(a,6,2,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_5,a_3,t_1     !sqr_add_c2(a,5,3,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_4,a_4,t_1     !sqr_add_c(a,4,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+32]    !r[8]=c3;

        umul    a_2,a_7,t_1     !sqr_add_c2(a,7,2,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    %g0,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_3,a_6,t_1     !sqr_add_c2(a,6,3,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_4,a_5,t_1     !sqr_add_c2(a,5,4,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        st      c_1,[%i0+36]    !r[9]=c1;

        umul    a_7,a_3,t_1     !sqr_add_c2(a,7,3,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    %g0,%g0,c_1
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_6,a_4,t_1     !sqr_add_c2(a,6,4,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_5,a_5,t_1     !sqr_add_c(a,5,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+40]    !r[10]=c2;

        umul    a_4,a_7,t_1     !sqr_add_c2(a,7,4,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_5,a_6,t_1     !sqr_add_c2(a,6,5,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+44]    !r[11]=c3;

        umul    a_7,a_5,t_1     !sqr_add_c2(a,7,5,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    %g0,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_6,a_6,t_1     !sqr_add_c(a,6,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        st      c_1,[%i0+48]    !r[12]=c1;

        umul    a_6,a_7,t_1     !sqr_add_c2(a,7,6,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    %g0,%g0,c_1
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+52]    !r[13]=c2;

        umul    a_7,a_7,t_1     !sqr_add_c(a,7,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        st      c_3,[%i0+56]    !r[14]=c3;
        st      c_1,[%i0+60]    !r[15]=c1;

        ret
        restore %g0,%g0,%o0

.type   bn_sqr_comba8,2
.size   bn_sqr_comba8,(.-bn_sqr_comba8)

.align  32

.global bn_sqr_comba4
/*
 * void bn_sqr_comba4(r,a)
 * BN_ULONG *r,*a;
 */
bn_sqr_comba4:
        save    %sp,FRAME_SIZE,%sp
        ld      a_0_,a_0
        umul    a_0,a_0,c_1     !sqr_add_c(a,0,c1,c2,c3);
        ld      a_1_,a_1
        rd      %y,c_2
        st      c_1,[%i0]       !r[0]=c1;

        umul    a_0,a_1,t_1     !sqr_add_c2(a,1,0,c2,c3,c1);
        ld      a_1_,a_1
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  %g0,t_2,c_3
        addx    %g0,%g0,c_1
        ld      a_2_,a_2
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+4]     !r[1]=c2;

        umul    a_2,a_0,t_1     !sqr_add_c2(a,2,0,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        umul    a_1,a_1,t_1     !sqr_add_c(a,1,c3,c1,c2);
        ld      a_3_,a_3
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+8]     !r[2]=c3;

        umul    a_0,a_3,t_1     !sqr_add_c2(a,3,0,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    %g0,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        umul    a_1,a_2,t_1     !sqr_add_c2(a,2,1,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        addcc   c_1,t_1,c_1
        addxcc  c_2,t_2,c_2
        addx    c_3,%g0,c_3
        st      c_1,[%i0+12]    !r[3]=c1;

        umul    a_3,a_1,t_1     !sqr_add_c2(a,3,1,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    %g0,%g0,c_1
        addcc   c_2,t_1,c_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        umul    a_2,a_2,t_1     !sqr_add_c(a,2,c2,c3,c1);
        addcc   c_2,t_1,c_2
        rd      %y,t_2
        addxcc  c_3,t_2,c_3
        addx    c_1,%g0,c_1
        st      c_2,[%i0+16]    !r[4]=c2;

        umul    a_2,a_3,t_1     !sqr_add_c2(a,3,2,c3,c1,c2);
        addcc   c_3,t_1,c_3
        rd      %y,t_2
        addxcc  c_1,t_2,c_1
        addx    %g0,%g0,c_2
        addcc   c_3,t_1,c_3
        addxcc  c_1,t_2,c_1
        addx    c_2,%g0,c_2
        st      c_3,[%i0+20]    !r[5]=c3;

        umul    a_3,a_3,t_1     !sqr_add_c(a,3,c1,c2,c3);
        addcc   c_1,t_1,c_1
        rd      %y,t_2
        addxcc  c_2,t_2,c_2
        st      c_1,[%i0+24]    !r[6]=c1;
        st      c_2,[%i0+28]    !r[7]=c2;
        
        ret
        restore %g0,%g0,%o0

.type   bn_sqr_comba4,2
.size   bn_sqr_comba4,(.-bn_sqr_comba4)

Re: 0.9.2b Sparc problem

Reply via email to