Hi again! > > Bottom line. Expect version 1.1 implemention after this weekend:-) Find version 1.2 attached. There're *no* machine code changes, but some gas (GNU assembler) specific clean-ups, documentation concerning v9 (i.e. Solaris 7 in 64 bit mode) updates and minor preprocessing rearrangements in bn_*_comba[48] functions. Do read comments in the beginning of v8plus file... I just would like to comment new performance comparison chart found in queez-n-answers section: > * v8plus module on U10/300MHz against bn_asm.c compiled with: > * > * cc-5.0 -xarch=v8plus -xO5 -xdepend +7-12% > * cc-4.2 -xarch=v8plus -xO5 -xdepend +25-35% > * egcs-1.1.2 -mcpu=ultrasparc -O3 +35-45% > * > * v8 module on SS10/60MHz against bn_asm.c compiled with: > * > * cc-5.0 -xarch=v8 -xO5 -xdepend +7-10% > * cc-4.2 -xarch=v8 -xO5 -xdepend +10% > * egcs-1.1.2 -mv8 -O3 +35-45% > * I just want to add one (preliminary) number for Solaris 7 in 64 bits mode. cc-5.0 -xarch=v9 -x05 -xdepend (i.e. pure/real 64 bits) generates code that performs about 5-7% faster than one generated with cc-5.0 -xarch=v8plus -xO5 -xdepend (i.e. under 32 bits kernel). Kind of expected... As I already mentioned earlier, there is no *computational* difference between v9 and v8plus. But in v9 model CPU sucks data in 64 bits chunks, while in v8plus model it has to fetch data in 32 bits words, which explains minor performance boost. And finally about v9 assembler implementation. As you can see from the table it's damn hard to compete with new Sun C compiler and I don't see any sign that it would be possible to handcode anything that would perform at least 15% better. For this reason I do *not* see any burning need for v9 assembler implementation at this point. Well, because SC5.0 is the only compiler capable of generating 64 bit code for Solaris 7 anyway... Well, UltraLinux people might be interested in it... But I in turn don't have any UltraLinux boxed and don't plan to get any... So let's put it this way. If there actually *are* UltraLinux people out there *really* interested in this, contact me and we'll see... Cheers. Andy.
.ident "bn_asm.sparc.v8plus.S, Version 1.2" .ident "SPARC v8plus ISA artwork by Andy Polyakov <[EMAIL PROTECTED]>" /* * ==================================================================== * Copyright (c) 1999 Andy Polyakov <[EMAIL PROTECTED]>. * * Rights for redistribution and usage in source and binary forms are * granted as long as above copyright notices are retained. Warranty * of any kind is (of course:-) disclaimed. * ==================================================================== */ /* * This is my modest contributon to OpenSSL project (see * http://www.openssl.org/ for more information about it) and is * a drop-in UltraSPARC ISA replacement for crypto/bn/bn_asm.c * module. For updates see http://fy.chalmers.se/~appro/hpe/. * * Questions-n-answers. * * Q. How to compile? * A. With SC4.x/SC5.x: * * cc -xarch=v8plus -c bn_asm.sparc.v8plus.S -o bn_asm.o * * and with gcc: * * gcc -mcpu=ultrasparc -c bn_asm.sparc.v8plus.S -o bn_asm.o * * or if above fails (it does if you have gas installed): * * gcc -E bn_asm.sparc.v8plus.S | as -xarch=v8plus /dev/fd/0 -o bn_asm.o * * Quick-n-dirty way to fuse the module into the library. * Provided that the library is already configured and built * (in 0.9.2 case with no_asm option): * * # cd crypto/bn * # cp /some/place/bn_asm.sparc.v8plus.S . * # cc -xarch=v8plus -c bn_asm.sparc.v8plus.S -o bn_asm.o * # make * # cd ../.. * # make; make test * * Quick-n-dirty way to get rid of it: * * # cd crypto/bn * # touch bn_asm.c * # make * # cd ../.. * # make; make test * * Q. What is v8plus exactly for architecture? * A. Well, it's rather a programming model than architecture... * It's actually v9-compliant, i.e. *any* UltraSPARC, CPU under * special conditions, namely when kernel doesn't preserve upper * 32 bits of otherwise 64-bit registers during a context switch. * * Q. Why just UltraSPARC? What about SuperSPARC? * A. Original release did target UltraSPARC only. Now SuperSPARC * version is provided along. Both version share bn_*comba[48] * implementations (see comment later in code for explanation). * But what's so special about this UltraSPARC implementation? * Why didn't I let compiler do the job? Trouble is that most of * available compilers (well, SC5.0 is the only exception) don't * attempt to take advantage of UltraSPARC's 64-bitness under * 32-bit kernels even though it's perfectly possible (see next * question). * * Q. 64-bit registers under 32-bit kernels? Didn't you just say it * doesn't work? * A. You can't adress *all* registers as 64-bit wide:-( The catch is * that you actually may rely upon %o0-%o5 and %g1-%g4 being fully * preserved if you're in a leaf function, i.e. such never calling * any other functions. All functions in this module are leaf and * 10 registers is a handful. And as a matter of fact none-"comba" * routines don't require even that much and I could even afford to * not allocate own stack frame for 'em:-) * * Q. What about 64-bit kernels? * A. What about 'em? Just kidding:-) Pure 64-bits version is currently * under evaluation and development... * * Q. What about sharable libraries? * A. What about 'em? Kidding again:-) Code does *not* contain any * code position dependencies and it's safe to include it into * sharable library as is. * * Q. How much faster does it get? * A. Do you have a good benchmark? In either case below is what I * experience with crypto/bn/expspeed.c test program: * * v8plus module on U10/300MHz against bn_asm.c compiled with: * * cc-5.0 -xarch=v8plus -xO5 -xdepend +7-12% * cc-4.2 -xarch=v8plus -xO5 -xdepend +25-35% * egcs-1.1.2 -mcpu=ultrasparc -O3 +35-45% * * v8 module on SS10/60MHz against bn_asm.c compiled with: * * cc-5.0 -xarch=v8 -xO5 -xdepend +7-10% * cc-4.2 -xarch=v8 -xO5 -xdepend +10% * egcs-1.1.2 -mv8 -O3 +35-45% * * As you can see it's damn hard to beat the new Sun C compiler * and it's in first hand GNU C users who will appreciate this * assembler implementation:-) */ /* * Revision history. * * 1.0 - initial release; * 1.1 - new loop unrolling model(*); * - some more fine tuning; * 1.2 - made gas friendly; * - updates to documentation concerning v9; * - new performance comparison matrix; * * (*) Originally unrolled loop looked like this: * for (;;) { * op(p+0); if (--n==0) break; * op(p+1); if (--n==0) break; * op(p+2); if (--n==0) break; * op(p+3); if (--n==0) break; * p+=4; * } * I unroll according to following: * while (n&~3) { * op(p+0); op(p+1); op(p+2); op(p+3); * p+=4; n=-4; * } * if (n) { * op(p+0); if (--n==0) return; * op(p+2); if (--n==0) return; * op(p+3); return; * } */ .section ".text",#alloc,#execinstr .file "bn_asm.sparc.v8plus.S" .align 32 .global bn_mul_add_words /* * BN_ULONG bn_mul_add_words(rp,ap,num,w) * BN_ULONG *rp,*ap; * int num; * BN_ULONG w; */ bn_mul_add_words: brgz,a %o2,.L_bn_mul_add_words_proceed lduw [%o1],%g2 retl clr %o0 .L_bn_mul_add_words_proceed: clruw %o3 andcc %o2,-4,%g0 bz,pn %icc,.L_bn_mul_add_words_tail clr %o5 .L_bn_mul_add_words_loop: ! wow! 32 aligned! lduw [%o0],%o4 mulx %o3,%g2,%g2 add %o4,%o5,%o4 add %o4,%g2,%o4 lduw [%o1+4],%g3 nop stuw %o4,[%o0] srlx %o4,32,%o5 lduw [%o0+4],%o4 mulx %o3,%g3,%g3 dec 4,%o2 add %o4,%o5,%o4 lduw [%o1+8],%g2 add %o4,%g3,%o4 stuw %o4,[%o0+4] srlx %o4,32,%o5 lduw [%o0+8],%o4 mulx %o3,%g2,%g2 inc 16,%o1 add %o4,%o5,%o4 lduw [%o1-4],%g3 add %o4,%g2,%o4 stuw %o4,[%o0+8] srlx %o4,32,%o5 lduw [%o0+12],%o4 mulx %o3,%g3,%g3 add %o4,%o5,%o4 inc 16,%o0 add %o4,%g3,%o4 srlx %o4,32,%o5 stuw %o4,[%o0-4] andcc %o2,-4,%g0 bnz,a,pt %icc,.L_bn_mul_add_words_loop lduw [%o1],%g2 brnz,a,pn %o2,.L_bn_mul_add_words_tail lduw [%o1],%g2 .L_bn_mul_add_words_return: retl mov %o5,%o0 .L_bn_mul_add_words_tail: lduw [%o0],%o4 mulx %o3,%g2,%g2 add %o4,%o5,%o4 dec %o2 add %o4,%g2,%o4 srlx %o4,32,%o5 brz,pt %o2,.L_bn_mul_add_words_return stuw %o4,[%o0] lduw [%o1+4],%g2 mulx %o3,%g2,%g2 lduw [%o0+4],%o4 add %o4,%o5,%o4 dec %o2 add %o4,%g2,%o4 srlx %o4,32,%o5 brz,pt %o2,.L_bn_mul_add_words_return stuw %o4,[%o0+4] lduw [%o1+8],%g2 mulx %o3,%g2,%g2 lduw [%o0+8],%o4 add %o4,%o5,%o4 add %o4,%g2,%o4 stuw %o4,[%o0+8] retl srlx %o4,32,%o0 .type bn_mul_add_words,#function .size bn_mul_add_words,(.-bn_mul_add_words) .align 32 .global bn_mul_words /* * BN_ULONG bn_mul_words(rp,ap,num,w) * BN_ULONG *rp,*ap; * int num; * BN_ULONG w; */ bn_mul_words: brgz,a %o2,.L_bn_mul_words_proceeed lduw [%o1],%g2 retl clr %o0 .L_bn_mul_words_proceeed: clruw %o3 andcc %o2,-4,%g0 bz,pn %icc,.L_bn_mul_words_tail clr %o5 .L_bn_mul_words_loop: ! wow! 32 aligned! lduw [%o1+4],%g3 mulx %o3,%g2,%g2 add %g2,%o5,%g2 nop srlx %g2,32,%o5 stuw %g2,[%o0] lduw [%o1+8],%g2 mulx %o3,%g3,%g3 add %g3,%o5,%g3 dec 4,%o2 stuw %g3,[%o0+4] srlx %g3,32,%o5 lduw [%o1+12],%g3 mulx %o3,%g2,%g2 add %g2,%o5,%g2 inc 16,%o1 stuw %g2,[%o0+8] srlx %g2,32,%o5 mulx %o3,%g3,%g3 inc 16,%o0 add %g3,%o5,%g3 nop stuw %g3,[%o0-4] srlx %g3,32,%o5 andcc %o2,-4,%g0 bnz,a,pt %icc,.L_bn_mul_words_loop lduw [%o1],%g2 nop brnz,a,pn %o2,.L_bn_mul_words_tail lduw [%o1],%g2 .L_bn_mul_words_return: retl mov %o5,%o0 .L_bn_mul_words_tail: mulx %o3,%g2,%g2 add %g2,%o5,%g2 dec %o2 srlx %g2,32,%o5 brz,pt %o2,.L_bn_mul_words_return stuw %g2,[%o0] lduw [%o1+4],%g2 mulx %o3,%g2,%g2 add %g2,%o5,%g2 dec %o2 srlx %g2,32,%o5 brz,pt %o2,.L_bn_mul_words_return stuw %g2,[%o0+4] lduw [%o1+8],%g2 mulx %o3,%g2,%g2 add %g2,%o5,%g2 stuw %g2,[%o0+8] retl srlx %g2,32,%o0 .type bn_mul_words,#function .size bn_mul_words,(.-bn_mul_words) .align 32 .global bn_sqr_words /* * void bn_sqr_words(r,a,n) * BN_ULONG *r,*a; * int n; */ bn_sqr_words: brgz,a %o2,.L_bn_sqr_words_proceeed lduw [%o1],%g2 retl clr %o0 .L_bn_sqr_words_proceeed: andcc %o2,-4,%g0 nop bz,pn %icc,.L_bn_sqr_words_tail nop .L_bn_sqr_words_loop: ! wow! 32 aligned! lduw [%o1+4],%g3 mulx %g2,%g2,%o4 stuw %o4,[%o0] srlx %o4,32,%o5 stuw %o5,[%o0+4] nop lduw [%o1+8],%g2 mulx %g3,%g3,%o4 dec 4,%o2 stuw %o4,[%o0+8] srlx %o4,32,%o5 nop stuw %o5,[%o0+12] lduw [%o1+12],%g3 mulx %g2,%g2,%o4 srlx %o4,32,%o5 stuw %o4,[%o0+16] inc 16,%o1 stuw %o5,[%o0+20] mulx %g3,%g3,%o4 inc 32,%o0 stuw %o4,[%o0-8] srlx %o4,32,%o5 andcc %o2,-4,%g2 stuw %o5,[%o0-4] bnz,a,pt %icc,.L_bn_sqr_words_loop lduw [%o1],%g2 nop brnz,a,pn %o2,.L_bn_sqr_words_tail lduw [%o1],%g2 .L_bn_sqr_words_return: retl clr %o0 .L_bn_sqr_words_tail: mulx %g2,%g2,%o4 dec %o2 stuw %o4,[%o0] srlx %o4,32,%o5 brz,pt %o2,.L_bn_sqr_words_return stuw %o5,[%o0+4] lduw [%o1+4],%g2 mulx %g2,%g2,%o4 dec %o2 stuw %o4,[%o0+8] srlx %o4,32,%o5 brz,pt %o2,.L_bn_sqr_words_return stuw %o5,[%o0+12] lduw [%o1+8],%g2 mulx %g2,%g2,%o4 srlx %o4,32,%o5 stuw %o4,[%o0+16] stuw %o5,[%o0+20] retl clr %o0 .type bn_sqr_words,#function .size bn_sqr_words,(.-bn_sqr_words) .align 32 .global bn_div_words /* * BN_ULONG bn_div_words(h,l,d) * BN_ULONG h,l,d; */ bn_div_words: sllx %o0,32,%o0 or %o0,%o1,%o0 udivx %o0,%o2,%o0 retl clruw %o0 .type bn_div_words,#function .size bn_div_words,(.-bn_div_words) .align 32 .global bn_add_words /* * BN_ULONG bn_add_words(rp,ap,bp,n) * BN_ULONG *rp,*ap,*bp; * int n; */ bn_add_words: brgz,a %o3,.L_bn_add_words_proceed lduw [%o1],%o4 retl clr %o0 .L_bn_add_words_proceed: andcc %o3,-4,%g0 bz,pn %icc,.L_bn_add_words_tail addcc %g0,0,%g0 ! clear carry flag nop lduw [%o2],%o5 dec 4,%o3 addcc %o5,%o4,%o5 nop stuw %o5,[%o0] ba .L_bn_add_words_warm_loop lduw [%o1+4],%o4 nop .L_bn_add_words_loop: ! wow! 32 aligned! dec 4,%o3 lduw [%o2],%o5 nop addccc %o5,%o4,%o5 stuw %o5,[%o0] lduw [%o1+4],%o4 .L_bn_add_words_warm_loop: inc 16,%o1 nop lduw [%o2+4],%o5 addccc %o5,%o4,%o5 stuw %o5,[%o0+4] nop lduw [%o1-8],%o4 inc 16,%o2 lduw [%o2-8],%o5 addccc %o5,%o4,%o5 stuw %o5,[%o0+8] lduw [%o1-4],%o4 inc 16,%o0 nop lduw [%o2-4],%o5 addccc %o5,%o4,%o5 stuw %o5,[%o0-4] and %o3,-4,%g1 brnz,a,pt %g1,.L_bn_add_words_loop lduw [%o1],%o4 brnz,a,pn %o3,.L_bn_add_words_tail lduw [%o1],%o4 .L_bn_add_words_return: clr %o0 retl movcs %icc,1,%o0 nop .L_bn_add_words_tail: ! wow! 32 aligned! lduw [%o2],%o5 dec %o3 addccc %o5,%o4,%o5 brz,pt %o3,.L_bn_add_words_return stuw %o5,[%o0] lduw [%o1+4],%o4 lduw [%o2+4],%o5 addccc %o5,%o4,%o5 dec %o3 brz,pt %o3,.L_bn_add_words_return stuw %o5,[%o0+4] nop lduw [%o1+8],%o4 lduw [%o2+8],%o5 addccc %o5,%o4,%o5 nop stuw %o5,[%o0+8] clr %o0 retl movcs %icc,1,%o0 .type bn_add_words,#function .size bn_add_words,(.-bn_add_words) .global bn_sub_words /* * BN_ULONG bn_sub_words(rp,ap,bp,n) * BN_ULONG *rp,*ap,*bp; * int n; */ bn_sub_words: brgz,a %o3,.L_bn_sub_words_proceed lduw [%o1],%o4 retl clr %o0 .L_bn_sub_words_proceed: andcc %o3,-4,%g0 bz,pn %icc,.L_bn_sub_words_tail addcc %g0,0,%g0 ! clear carry flag nop lduw [%o2],%o5 dec 4,%o3 subcc %o4,%o5,%o5 nop stuw %o5,[%o0] ba .L_bn_sub_words_warm_loop lduw [%o1+4],%o4 nop .L_bn_sub_words_loop: ! wow! 32 aligned! dec 4,%o3 lduw [%o2],%o5 nop subccc %o4,%o5,%o5 stuw %o5,[%o0] lduw [%o1+4],%o4 .L_bn_sub_words_warm_loop: inc 16,%o1 nop lduw [%o2+4],%o5 subccc %o4,%o5,%o5 stuw %o5,[%o0+4] nop lduw [%o1-8],%o4 inc 16,%o2 lduw [%o2-8],%o5 subccc %o4,%o5,%o5 stuw %o5,[%o0+8] lduw [%o1-4],%o4 inc 16,%o0 nop lduw [%o2-4],%o5 subccc %o4,%o5,%o5 stuw %o5,[%o0-4] and %o3,-4,%g1 brnz,a,pt %g1,.L_bn_sub_words_loop lduw [%o1],%o4 brnz,a,pn %o3,.L_bn_sub_words_tail lduw [%o1],%o4 .L_bn_sub_words_return: clr %o0 retl movcs %icc,1,%o0 nop .L_bn_sub_words_tail: ! wow! 32 aligned! lduw [%o2],%o5 dec %o3 subccc %o4,%o5,%o5 brz,pt %o3,.L_bn_sub_words_return stuw %o5,[%o0] lduw [%o1+4],%o4 lduw [%o2+4],%o5 subccc %o4,%o5,%o5 dec %o3 brz,pt %o3,.L_bn_sub_words_return stuw %o5,[%o0+4] nop lduw [%o1+8],%o4 lduw [%o2+8],%o5 subccc %o4,%o5,%o5 nop stuw %o5,[%o0+8] clr %o0 retl movcs %icc,1,%o0 .type bn_sub_words,#function .size bn_sub_words,(.-bn_sub_words) /* * Following code is pure SPARC V8! Trouble is that it's not feasible * to implement the mumbo-jumbo in less "V9" instructions:-( At least not * under 32-bit kernel. The reason is that you'd have to shuffle registers * all the time as only few (well, 10:-) are fully (i.e. all 64 bits) * preserved by kernel during context switch. But even under 64-bit kernel * you won't gain much because in the lack of "add with extended carry" * instruction you'd have to issue 'clr %rx; movcs %xcc,1,%rx; * add %rd,%rx,%rd' sequence in place of 'addxcc %rx,%ry,%rx; * addx %rz,%g0,%rz' pair in 32-bit case. Well, 'bpcs,a %xcc,.+8; inc %rd' * is another alternative... * * Andy. */ #define FRAME_SIZE -96 /* * Here is register usage map for *all* routines below. */ #define t_1 %o0 #define t_2 %o1 #define c_1 %o2 #define c_2 %o3 #define c_3 %o4 #define a(I) [%i1+4*I] #define b(I) [%i2+4*I] #define r(I) [%i0+4*I] #define a_0 %l0 #define a_1 %l1 #define a_2 %l2 #define a_3 %l3 #define a_4 %l4 #define a_5 %l5 #define a_6 %l6 #define a_7 %l7 #define b_0 %i3 #define b_1 %i4 #define b_2 %i5 #define b_3 %o5 #define b_4 %g1 #define b_5 %g2 #define b_6 %g3 #define b_7 %g4 .align 32 .global bn_mul_comba8 /* * void bn_mul_comba8(r,a,b) * BN_ULONG *r,*a,*b; */ bn_mul_comba8: save %sp,FRAME_SIZE,%sp ld a(0),a_0 ld b(0),b_0 umul a_0,b_0,c_1 !=!mul_add_c(a[0],b[0],c1,c2,c3); ld b(1),b_1 rd %y,c_2 st c_1,r(0) !r[0]=c1; umul a_0,b_1,t_1 !=!mul_add_c(a[0],b[1],c2,c3,c1); ld a(1),a_1 addcc c_2,t_1,c_2 rd %y,t_2 addxcc %g0,t_2,c_3 != addx %g0,%g0,c_1 ld a(2),a_2 umul a_1,b_0,t_1 !mul_add_c(a[1],b[0],c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 st c_2,r(1) !r[1]=c2; addx c_1,%g0,c_1 != umul a_2,b_0,t_1 !mul_add_c(a[2],b[0],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx %g0,%g0,c_2 ld b(2),b_2 umul a_1,b_1,t_1 !mul_add_c(a[1],b[1],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 ld b(3),b_3 addx c_2,%g0,c_2 != umul a_0,b_2,t_1 !mul_add_c(a[0],b[2],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 st c_3,r(2) !r[2]=c3; umul a_0,b_3,t_1 !mul_add_c(a[0],b[3],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 umul a_1,b_2,t_1 !=!mul_add_c(a[1],b[2],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != ld a(3),a_3 umul a_2,b_1,t_1 !mul_add_c(a[2],b[1],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 ld a(4),a_4 umul a_3,b_0,t_1 !mul_add_c(a[3],b[0],c1,c2,c3);!= addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(3) !r[3]=c1; umul a_4,b_0,t_1 !mul_add_c(a[4],b[0],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 umul a_3,b_1,t_1 !mul_add_c(a[3],b[1],c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 umul a_2,b_2,t_1 !=!mul_add_c(a[2],b[2],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != ld b(4),b_4 umul a_1,b_3,t_1 !mul_add_c(a[1],b[3],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 ld b(5),b_5 umul a_0,b_4,t_1 !=!mul_add_c(a[0],b[4],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != st c_2,r(4) !r[4]=c2; umul a_0,b_5,t_1 !mul_add_c(a[0],b[5],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 umul a_1,b_4,t_1 !mul_add_c(a[1],b[4],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_2,b_3,t_1 !=!mul_add_c(a[2],b[3],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != umul a_3,b_2,t_1 !mul_add_c(a[3],b[2],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 ld a(5),a_5 umul a_4,b_1,t_1 !mul_add_c(a[4],b[1],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 ld a(6),a_6 addx c_2,%g0,c_2 != umul a_5,b_0,t_1 !mul_add_c(a[5],b[0],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 st c_3,r(5) !r[5]=c3; umul a_6,b_0,t_1 !mul_add_c(a[6],b[0],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 umul a_5,b_1,t_1 !=!mul_add_c(a[5],b[1],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_4,b_2,t_1 !mul_add_c(a[4],b[2],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 umul a_3,b_3,t_1 !mul_add_c(a[3],b[3],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_2,b_4,t_1 !mul_add_c(a[2],b[4],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 ld b(6),b_6 addx c_3,%g0,c_3 != umul a_1,b_5,t_1 !mul_add_c(a[1],b[5],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 ld b(7),b_7 umul a_0,b_6,t_1 !mul_add_c(a[0],b[6],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 st c_1,r(6) !r[6]=c1; addx c_3,%g0,c_3 != umul a_0,b_7,t_1 !mul_add_c(a[0],b[7],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 != addx %g0,%g0,c_1 umul a_1,b_6,t_1 !mul_add_c(a[1],b[6],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 umul a_2,b_5,t_1 !mul_add_c(a[2],b[5],c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 umul a_3,b_4,t_1 !=!mul_add_c(a[3],b[4],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != umul a_4,b_3,t_1 !mul_add_c(a[4],b[3],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_5,b_2,t_1 !mul_add_c(a[5],b[2],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 ld a(7),a_7 umul a_6,b_1,t_1 !=!mul_add_c(a[6],b[1],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != umul a_7,b_0,t_1 !mul_add_c(a[7],b[0],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 st c_2,r(7) !r[7]=c2; umul a_7,b_1,t_1 !mul_add_c(a[7],b[1],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 umul a_6,b_2,t_1 !=!mul_add_c(a[6],b[2],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != umul a_5,b_3,t_1 !mul_add_c(a[5],b[3],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 umul a_4,b_4,t_1 !mul_add_c(a[4],b[4],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_3,b_5,t_1 !mul_add_c(a[3],b[5],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_2,b_6,t_1 !=!mul_add_c(a[2],b[6],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != umul a_1,b_7,t_1 !mul_add_c(a[1],b[7],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 ! addx c_2,%g0,c_2 st c_3,r(8) !r[8]=c3; umul a_2,b_7,t_1 !mul_add_c(a[2],b[7],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 umul a_3,b_6,t_1 !=!mul_add_c(a[3],b[6],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_4,b_5,t_1 !mul_add_c(a[4],b[5],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 umul a_5,b_4,t_1 !mul_add_c(a[5],b[4],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_6,b_3,t_1 !mul_add_c(a[6],b[3],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_7,b_2,t_1 !=!mul_add_c(a[7],b[2],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(9) !r[9]=c1; umul a_7,b_3,t_1 !mul_add_c(a[7],b[3],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 umul a_6,b_4,t_1 !mul_add_c(a[6],b[4],c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 umul a_5,b_5,t_1 !=!mul_add_c(a[5],b[5],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != umul a_4,b_6,t_1 !mul_add_c(a[4],b[6],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_3,b_7,t_1 !mul_add_c(a[3],b[7],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 st c_2,r(10) !r[10]=c2; umul a_4,b_7,t_1 !=!mul_add_c(a[4],b[7],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 != umul a_5,b_6,t_1 !mul_add_c(a[5],b[6],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 umul a_6,b_5,t_1 !mul_add_c(a[6],b[5],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_7,b_4,t_1 !mul_add_c(a[7],b[4],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 st c_3,r(11) !r[11]=c3; addx c_2,%g0,c_2 != umul a_7,b_5,t_1 !mul_add_c(a[7],b[5],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx %g0,%g0,c_3 umul a_6,b_6,t_1 !mul_add_c(a[6],b[6],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_5,b_7,t_1 !mul_add_c(a[5],b[7],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 st c_1,r(12) !r[12]=c1; addx c_3,%g0,c_3 != umul a_6,b_7,t_1 !mul_add_c(a[6],b[7],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 != addx %g0,%g0,c_1 umul a_7,b_6,t_1 !mul_add_c(a[7],b[6],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 st c_2,r(13) !r[13]=c2; umul a_7,b_7,t_1 !=!mul_add_c(a[7],b[7],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 nop != st c_3,r(14) !r[14]=c3; st c_1,r(15) !r[15]=c1; ret restore %g0,%g0,%o0 .type bn_mul_comba8,#function .size bn_mul_comba8,(.-bn_mul_comba8) .align 32 .global bn_mul_comba4 /* * void bn_mul_comba4(r,a,b) * BN_ULONG *r,*a,*b; */ bn_mul_comba4: save %sp,FRAME_SIZE,%sp ld a(0),a_0 ld b(0),b_0 umul a_0,b_0,c_1 !=!mul_add_c(a[0],b[0],c1,c2,c3); ld b(1),b_1 rd %y,c_2 st c_1,r(0) !r[0]=c1; umul a_0,b_1,t_1 !=!mul_add_c(a[0],b[1],c2,c3,c1); ld a(1),a_1 addcc c_2,t_1,c_2 rd %y,t_2 != addxcc %g0,t_2,c_3 addx %g0,%g0,c_1 ld a(2),a_2 umul a_1,b_0,t_1 !=!mul_add_c(a[1],b[0],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != st c_2,r(1) !r[1]=c2; umul a_2,b_0,t_1 !mul_add_c(a[2],b[0],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 ld b(2),b_2 umul a_1,b_1,t_1 !=!mul_add_c(a[1],b[1],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != ld b(3),b_3 umul a_0,b_2,t_1 !mul_add_c(a[0],b[2],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 st c_3,r(2) !r[2]=c3; umul a_0,b_3,t_1 !=!mul_add_c(a[0],b[3],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 != umul a_1,b_2,t_1 !mul_add_c(a[1],b[2],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 ld a(3),a_3 umul a_2,b_1,t_1 !mul_add_c(a[2],b[1],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_3,b_0,t_1 !=!mul_add_c(a[3],b[0],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(3) !r[3]=c1; umul a_3,b_1,t_1 !mul_add_c(a[3],b[1],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 umul a_2,b_2,t_1 !mul_add_c(a[2],b[2],c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 umul a_1,b_3,t_1 !=!mul_add_c(a[1],b[3],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != st c_2,r(4) !r[4]=c2; umul a_2,b_3,t_1 !mul_add_c(a[2],b[3],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 umul a_3,b_2,t_1 !mul_add_c(a[3],b[2],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 st c_3,r(5) !r[5]=c3; addx c_2,%g0,c_2 != umul a_3,b_3,t_1 !mul_add_c(a[3],b[3],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != st c_1,r(6) !r[6]=c1; st c_2,r(7) !r[7]=c2; ret restore %g0,%g0,%o0 .type bn_mul_comba4,#function .size bn_mul_comba4,(.-bn_mul_comba4) .align 32 .global bn_sqr_comba8 bn_sqr_comba8: save %sp,FRAME_SIZE,%sp ld a(0),a_0 ld a(1),a_1 umul a_0,a_0,c_1 !=!sqr_add_c(a,0,c1,c2,c3); rd %y,c_2 st c_1,r(0) !r[0]=c1; ld a(2),a_2 umul a_0,a_1,t_1 !=!sqr_add_c2(a,1,0,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc %g0,t_2,c_3 addx %g0,%g0,c_1 != addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 st c_2,r(1) !r[1]=c2; addx c_1,%g0,c_1 != umul a_2,a_0,t_1 !sqr_add_c2(a,2,0,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx %g0,%g0,c_2 addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != ld a(3),a_3 umul a_1,a_1,t_1 !sqr_add_c(a,1,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 st c_3,r(2) !r[2]=c3; umul a_0,a_3,t_1 !=!sqr_add_c2(a,3,0,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 != addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 ld a(4),a_4 addx c_3,%g0,c_3 != umul a_1,a_2,t_1 !sqr_add_c2(a,2,1,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(3) !r[3]=c1; umul a_4,a_0,t_1 !sqr_add_c2(a,4,0,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_3,a_1,t_1 !sqr_add_c2(a,3,1,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 ld a(5),a_5 umul a_2,a_2,t_1 !sqr_add_c(a,2,c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 st c_2,r(4) !r[4]=c2; addx c_1,%g0,c_1 != umul a_0,a_5,t_1 !sqr_add_c2(a,5,0,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx %g0,%g0,c_2 addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != umul a_1,a_4,t_1 !sqr_add_c2(a,4,1,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != ld a(6),a_6 umul a_2,a_3,t_1 !sqr_add_c2(a,3,2,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 st c_3,r(5) !r[5]=c3; umul a_6,a_0,t_1 !sqr_add_c2(a,6,0,c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 addcc c_1,t_1,c_1 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_5,a_1,t_1 !sqr_add_c2(a,5,1,c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 addcc c_1,t_1,c_1 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_4,a_2,t_1 !sqr_add_c2(a,4,2,c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 addcc c_1,t_1,c_1 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 ld a(7),a_7 umul a_3,a_3,t_1 !=!sqr_add_c(a,3,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(6) !r[6]=c1; umul a_0,a_7,t_1 !sqr_add_c2(a,7,0,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_1,a_6,t_1 !sqr_add_c2(a,6,1,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_2,a_5,t_1 !sqr_add_c2(a,5,2,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_3,a_4,t_1 !sqr_add_c2(a,4,3,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 st c_2,r(7) !r[7]=c2; umul a_7,a_1,t_1 !sqr_add_c2(a,7,1,c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 addcc c_3,t_1,c_3 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_6,a_2,t_1 !sqr_add_c2(a,6,2,c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 addcc c_3,t_1,c_3 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_5,a_3,t_1 !sqr_add_c2(a,5,3,c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 addcc c_3,t_1,c_3 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_4,a_4,t_1 !sqr_add_c(a,4,c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 st c_3,r(8) !r[8]=c3; addx c_2,%g0,c_2 != umul a_2,a_7,t_1 !sqr_add_c2(a,7,2,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx %g0,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_3,a_6,t_1 !sqr_add_c2(a,6,3,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_4,a_5,t_1 !sqr_add_c2(a,5,4,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(9) !r[9]=c1; umul a_7,a_3,t_1 !sqr_add_c2(a,7,3,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_6,a_4,t_1 !sqr_add_c2(a,6,4,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_5,a_5,t_1 !sqr_add_c(a,5,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 st c_2,r(10) !r[10]=c2; umul a_4,a_7,t_1 !=!sqr_add_c2(a,7,4,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 != addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_5,a_6,t_1 !=!sqr_add_c2(a,6,5,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 st c_3,r(11) !r[11]=c3; addx c_2,%g0,c_2 != umul a_7,a_5,t_1 !sqr_add_c2(a,7,5,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx %g0,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_6,a_6,t_1 !sqr_add_c(a,6,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 st c_1,r(12) !r[12]=c1; umul a_6,a_7,t_1 !sqr_add_c2(a,7,6,c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 st c_2,r(13) !r[13]=c2; addx c_1,%g0,c_1 != umul a_7,a_7,t_1 !sqr_add_c(a,7,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != st c_3,r(14) !r[14]=c3; st c_1,r(15) !r[15]=c1; ret restore %g0,%g0,%o0 .type bn_sqr_comba8,#function .size bn_sqr_comba8,(.-bn_sqr_comba8) .align 32 .global bn_sqr_comba4 /* * void bn_sqr_comba4(r,a) * BN_ULONG *r,*a; */ bn_sqr_comba4: save %sp,FRAME_SIZE,%sp ld a(0),a_0 umul a_0,a_0,c_1 !sqr_add_c(a,0,c1,c2,c3); ld a(1),a_1 != rd %y,c_2 st c_1,r(0) !r[0]=c1; ld a(1),a_1 umul a_0,a_1,t_1 !=!sqr_add_c2(a,1,0,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc %g0,t_2,c_3 addx %g0,%g0,c_1 != ld a(2),a_2 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != st c_2,r(1) !r[1]=c2; umul a_2,a_0,t_1 !sqr_add_c2(a,2,0,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 ld a(3),a_3 umul a_1,a_1,t_1 !sqr_add_c(a,1,c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 st c_3,r(2) !r[2]=c3; addx c_2,%g0,c_2 != umul a_0,a_3,t_1 !sqr_add_c2(a,3,0,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx %g0,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_1,a_2,t_1 !sqr_add_c2(a,2,1,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(3) !r[3]=c1; umul a_3,a_1,t_1 !sqr_add_c2(a,3,1,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_2,a_2,t_1 !sqr_add_c(a,2,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 st c_2,r(4) !r[4]=c2; umul a_2,a_3,t_1 !=!sqr_add_c2(a,3,2,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 != addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 st c_3,r(5) !r[5]=c3; addx c_2,%g0,c_2 != umul a_3,a_3,t_1 !sqr_add_c(a,3,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != st c_1,r(6) !r[6]=c1; st c_2,r(7) !r[7]=c2; ret restore %g0,%g0,%o0 .type bn_sqr_comba4,#function .size bn_sqr_comba4,(.-bn_sqr_comba4) .align 32
.ident "bn_asm.sparc.v8.S, Version 1.2" .ident "SPARC v8 ISA artwork by Andy Polyakov <[EMAIL PROTECTED]>" /* * ==================================================================== * Copyright (c) 1999 Andy Polyakov <[EMAIL PROTECTED]>. * * Rights for redistribution and usage in source and binary forms are * granted as long as above copyright notices are retained. Warranty * of any kind is (of course:-) disclaimed. * ==================================================================== */ /* * This is my modest contributon to OpenSSL project (see * http://www.openssl.org/ for more information about it) and is * a drop-in SuperSPARC ISA replacement for crypto/bn/bn_asm.c * module. For updates see http://fy.chalmers.se/~appro/hpe/. * * See bn_asm.sparc.v8plus.S for more details. */ /* * Revision history. * * 1.1 - new loop unrolling model(*); * 1.2 - made gas friendly; * * (*) see bn_asm.sparc.v8plus.S for details */ .section ".text",#alloc,#execinstr .file "bn_asm.sparc.v8.S" .align 32 .global bn_mul_add_words /* * BN_ULONG bn_mul_add_words(rp,ap,num,w) * BN_ULONG *rp,*ap; * int num; * BN_ULONG w; */ bn_mul_add_words: cmp %o2,0 bg,a .L_bn_mul_add_words_proceed ld [%o1],%g2 retl clr %o0 .L_bn_mul_add_words_proceed: andcc %o2,-4,%g0 bz .L_bn_mul_add_words_tail clr %o5 umul %o3,%g2,%g2 ld [%o0],%o4 rd %y,%g1 addcc %o4,%g2,%o4 ld [%o1+4],%g3 addx %g1,0,%o5 ba .L_bn_mul_add_words_warm_loop st %o4,[%o0] .L_bn_mul_add_words_loop: ld [%o0],%o4 umul %o3,%g2,%g2 rd %y,%g1 addcc %o4,%o5,%o4 ld [%o1+4],%g3 addx %g1,0,%g1 addcc %o4,%g2,%o4 nop addx %g1,0,%o5 st %o4,[%o0] .L_bn_mul_add_words_warm_loop: ld [%o0+4],%o4 umul %o3,%g3,%g3 dec 4,%o2 rd %y,%g1 addcc %o4,%o5,%o4 ld [%o1+8],%g2 addx %g1,0,%g1 addcc %o4,%g3,%o4 addx %g1,0,%o5 st %o4,[%o0+4] ld [%o0+8],%o4 umul %o3,%g2,%g2 inc 16,%o1 rd %y,%g1 addcc %o4,%o5,%o4 ld [%o1-4],%g3 addx %g1,0,%g1 addcc %o4,%g2,%o4 addx %g1,0,%o5 st %o4,[%o0+8] ld [%o0+12],%o4 umul %o3,%g3,%g3 inc 16,%o0 rd %y,%g1 addcc %o4,%o5,%o4 addx %g1,0,%g1 addcc %o4,%g3,%o4 addx %g1,0,%o5 st %o4,[%o0-4] andcc %o2,-4,%g0 bnz,a .L_bn_mul_add_words_loop ld [%o1],%g2 tst %o2 bnz,a .L_bn_mul_add_words_tail ld [%o1],%g2 .L_bn_mul_add_words_return: retl mov %o5,%o0 nop .L_bn_mul_add_words_tail: ld [%o0],%o4 umul %o3,%g2,%g2 addcc %o4,%o5,%o4 rd %y,%g1 addx %g1,0,%g1 addcc %o4,%g2,%o4 addx %g1,0,%o5 deccc %o2 bz .L_bn_mul_add_words_return st %o4,[%o0] ld [%o1+4],%g2 umul %o3,%g2,%g2 ld [%o0+4],%o4 rd %y,%g1 addcc %o4,%o5,%o4 nop addx %g1,0,%g1 addcc %o4,%g2,%o4 addx %g1,0,%o5 deccc %o2 bz .L_bn_mul_add_words_return st %o4,[%o0+4] ld [%o1+8],%g2 umul %o3,%g2,%g2 ld [%o0+8],%o4 rd %y,%g1 addcc %o4,%o5,%o4 addx %g1,0,%g1 addcc %o4,%g2,%o4 st %o4,[%o0+8] retl addx %g1,0,%o0 .type bn_mul_add_words,#function .size bn_mul_add_words,(.-bn_mul_add_words) .align 32 .global bn_mul_words /* * BN_ULONG bn_mul_words(rp,ap,num,w) * BN_ULONG *rp,*ap; * int num; * BN_ULONG w; */ bn_mul_words: cmp %o2,0 bg,a .L_bn_mul_words_proceeed ld [%o1],%g2 retl clr %o0 .L_bn_mul_words_proceeed: andcc %o2,-4,%g0 bz .L_bn_mul_words_tail clr %o5 .L_bn_mul_words_loop: ld [%o1+4],%g3 umul %o3,%g2,%g2 addcc %g2,%o5,%g2 rd %y,%g1 addx %g1,0,%o5 st %g2,[%o0] ld [%o1+8],%g2 umul %o3,%g3,%g3 addcc %g3,%o5,%g3 rd %y,%g1 dec 4,%o2 addx %g1,0,%o5 st %g3,[%o0+4] ld [%o1+12],%g3 umul %o3,%g2,%g2 addcc %g2,%o5,%g2 rd %y,%g1 inc 16,%o1 st %g2,[%o0+8] addx %g1,0,%o5 umul %o3,%g3,%g3 addcc %g3,%o5,%g3 rd %y,%g1 inc 16,%o0 addx %g1,0,%o5 st %g3,[%o0-4] andcc %o2,-4,%g0 nop bnz,a .L_bn_mul_words_loop ld [%o1],%g2 tst %o2 bnz,a .L_bn_mul_words_tail ld [%o1],%g2 .L_bn_mul_words_return: retl mov %o5,%o0 nop .L_bn_mul_words_tail: umul %o3,%g2,%g2 addcc %g2,%o5,%g2 rd %y,%g1 addx %g1,0,%o5 deccc %o2 bz .L_bn_mul_words_return st %g2,[%o0] nop ld [%o1+4],%g2 umul %o3,%g2,%g2 addcc %g2,%o5,%g2 rd %y,%g1 addx %g1,0,%o5 deccc %o2 bz .L_bn_mul_words_return st %g2,[%o0+4] ld [%o1+8],%g2 umul %o3,%g2,%g2 addcc %g2,%o5,%g2 rd %y,%g1 st %g2,[%o0+8] retl addx %g1,0,%o0 .type bn_mul_words,#function .size bn_mul_words,(.-bn_mul_words) .align 32 .global bn_sqr_words /* * void bn_sqr_words(r,a,n) * BN_ULONG *r,*a; * int n; */ bn_sqr_words: cmp %o2,0 bg,a .L_bn_sqr_words_proceeed ld [%o1],%g2 retl clr %o0 .L_bn_sqr_words_proceeed: andcc %o2,-4,%g0 bz .L_bn_sqr_words_tail clr %o5 .L_bn_sqr_words_loop: ld [%o1+4],%g3 umul %g2,%g2,%o4 st %o4,[%o0] rd %y,%o5 st %o5,[%o0+4] ld [%o1+8],%g2 umul %g3,%g3,%o4 dec 4,%o2 st %o4,[%o0+8] rd %y,%o5 st %o5,[%o0+12] nop ld [%o1+12],%g3 umul %g2,%g2,%o4 st %o4,[%o0+16] rd %y,%o5 inc 16,%o1 st %o5,[%o0+20] umul %g3,%g3,%o4 inc 32,%o0 st %o4,[%o0-8] rd %y,%o5 st %o5,[%o0-4] andcc %o2,-4,%g2 bnz,a .L_bn_sqr_words_loop ld [%o1],%g2 tst %o2 nop bnz,a .L_bn_sqr_words_tail ld [%o1],%g2 .L_bn_sqr_words_return: retl clr %o0 .L_bn_sqr_words_tail: umul %g2,%g2,%o4 st %o4,[%o0] deccc %o2 rd %y,%o5 bz .L_bn_sqr_words_return st %o5,[%o0+4] ld [%o1+4],%g2 umul %g2,%g2,%o4 st %o4,[%o0+8] deccc %o2 rd %y,%o5 nop bz .L_bn_sqr_words_return st %o5,[%o0+12] ld [%o1+8],%g2 umul %g2,%g2,%o4 st %o4,[%o0+16] rd %y,%o5 st %o5,[%o0+20] retl clr %o0 .type bn_sqr_words,#function .size bn_sqr_words,(.-bn_sqr_words) .align 32 .global bn_div_words /* * BN_ULONG bn_div_words(h,l,d) * BN_ULONG h,l,d; */ bn_div_words: wr %o0,%y udiv %o1,%o2,%o0 retl nop .type bn_div_words,#function .size bn_div_words,(.-bn_div_words) .align 32 .global bn_add_words /* * BN_ULONG bn_add_words(rp,ap,bp,n) * BN_ULONG *rp,*ap,*bp; * int n; */ bn_add_words: cmp %o3,0 bg,a .L_bn_add_words_proceed ld [%o1],%o4 retl clr %o0 .L_bn_add_words_proceed: andcc %o3,-4,%g0 bz .L_bn_add_words_tail clr %g1 ld [%o2],%o5 dec 4,%o3 addcc %o5,%o4,%o5 nop st %o5,[%o0] ba .L_bn_add_words_warm_loop ld [%o1+4],%o4 nop .L_bn_add_words_loop: ld [%o1],%o4 dec 4,%o3 ld [%o2],%o5 addxcc %o5,%o4,%o5 st %o5,[%o0] ld [%o1+4],%o4 .L_bn_add_words_warm_loop: inc 16,%o1 ld [%o2+4],%o5 addxcc %o5,%o4,%o5 st %o5,[%o0+4] ld [%o1-8],%o4 inc 16,%o2 ld [%o2-8],%o5 addxcc %o5,%o4,%o5 st %o5,[%o0+8] ld [%o1-4],%o4 inc 16,%o0 ld [%o2-4],%o5 addxcc %o5,%o4,%o5 st %o5,[%o0-4] addx %g0,0,%g1 andcc %o3,-4,%g0 bnz,a .L_bn_add_words_loop addcc %g1,-1,%g0 tst %o3 nop bnz,a .L_bn_add_words_tail ld [%o1],%o4 .L_bn_add_words_return: retl mov %g1,%o0 .L_bn_add_words_tail: addcc %g1,-1,%g0 ld [%o2],%o5 addxcc %o5,%o4,%o5 addx %g0,0,%g1 deccc %o3 bz .L_bn_add_words_return st %o5,[%o0] nop ld [%o1+4],%o4 addcc %g1,-1,%g0 ld [%o2+4],%o5 addxcc %o5,%o4,%o5 addx %g0,0,%g1 deccc %o3 bz .L_bn_add_words_return st %o5,[%o0+4] ld [%o1+8],%o4 addcc %g1,-1,%g0 ld [%o2+8],%o5 addxcc %o5,%o4,%o5 st %o5,[%o0+8] retl addx %g0,0,%o0 .type bn_add_words,#function .size bn_add_words,(.-bn_add_words) .align 32 .global bn_sub_words /* * BN_ULONG bn_sub_words(rp,ap,bp,n) * BN_ULONG *rp,*ap,*bp; * int n; */ bn_sub_words: cmp %o3,0 bg,a .L_bn_sub_words_proceed ld [%o1],%o4 retl clr %o0 .L_bn_sub_words_proceed: andcc %o3,-4,%g0 bz .L_bn_sub_words_tail clr %g1 ld [%o2],%o5 dec 4,%o3 subcc %o4,%o5,%o5 nop st %o5,[%o0] ba .L_bn_sub_words_warm_loop ld [%o1+4],%o4 nop .L_bn_sub_words_loop: ld [%o1],%o4 dec 4,%o3 ld [%o2],%o5 subxcc %o4,%o5,%o5 st %o5,[%o0] ld [%o1+4],%o4 .L_bn_sub_words_warm_loop: inc 16,%o1 ld [%o2+4],%o5 subxcc %o4,%o5,%o5 st %o5,[%o0+4] ld [%o1-8],%o4 inc 16,%o2 ld [%o2-8],%o5 subxcc %o4,%o5,%o5 st %o5,[%o0+8] ld [%o1-4],%o4 inc 16,%o0 ld [%o2-4],%o5 subxcc %o4,%o5,%o5 st %o5,[%o0-4] addx %g0,0,%g1 andcc %o3,-4,%g0 bnz,a .L_bn_sub_words_loop addcc %g1,-1,%g0 tst %o3 nop bnz,a .L_bn_sub_words_tail ld [%o1],%o4 .L_bn_sub_words_return: retl mov %g1,%o0 .L_bn_sub_words_tail: addcc %g1,-1,%g0 ld [%o2],%o5 subxcc %o4,%o5,%o5 addx %g0,0,%g1 deccc %o3 bz .L_bn_sub_words_return st %o5,[%o0] nop ld [%o1+4],%o4 addcc %g1,-1,%g0 ld [%o2+4],%o5 subxcc %o4,%o5,%o5 addx %g0,0,%g1 deccc %o3 bz .L_bn_sub_words_return st %o5,[%o0+4] ld [%o1+8],%o4 addcc %g1,-1,%g0 ld [%o2+8],%o5 subxcc %o4,%o5,%o5 st %o5,[%o0+8] retl addx %g0,0,%o0 .type bn_sub_words,#function .size bn_sub_words,(.-bn_sub_words) #define FRAME_SIZE -96 /* * Here is register usage map for *all* routines below. */ #define t_1 %o0 #define t_2 %o1 #define c_1 %o2 #define c_2 %o3 #define c_3 %o4 #define a(I) [%i1+4*I] #define b(I) [%i2+4*I] #define r(I) [%i0+4*I] #define a_0 %l0 #define a_1 %l1 #define a_2 %l2 #define a_3 %l3 #define a_4 %l4 #define a_5 %l5 #define a_6 %l6 #define a_7 %l7 #define b_0 %i3 #define b_1 %i4 #define b_2 %i5 #define b_3 %o5 #define b_4 %g1 #define b_5 %g2 #define b_6 %g3 #define b_7 %g4 .align 32 .global bn_mul_comba8 /* * void bn_mul_comba8(r,a,b) * BN_ULONG *r,*a,*b; */ bn_mul_comba8: save %sp,FRAME_SIZE,%sp ld a(0),a_0 ld b(0),b_0 umul a_0,b_0,c_1 !=!mul_add_c(a[0],b[0],c1,c2,c3); ld b(1),b_1 rd %y,c_2 st c_1,r(0) !r[0]=c1; umul a_0,b_1,t_1 !=!mul_add_c(a[0],b[1],c2,c3,c1); ld a(1),a_1 addcc c_2,t_1,c_2 rd %y,t_2 addxcc %g0,t_2,c_3 != addx %g0,%g0,c_1 ld a(2),a_2 umul a_1,b_0,t_1 !mul_add_c(a[1],b[0],c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 st c_2,r(1) !r[1]=c2; addx c_1,%g0,c_1 != umul a_2,b_0,t_1 !mul_add_c(a[2],b[0],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx %g0,%g0,c_2 ld b(2),b_2 umul a_1,b_1,t_1 !mul_add_c(a[1],b[1],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 ld b(3),b_3 addx c_2,%g0,c_2 != umul a_0,b_2,t_1 !mul_add_c(a[0],b[2],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 st c_3,r(2) !r[2]=c3; umul a_0,b_3,t_1 !mul_add_c(a[0],b[3],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 umul a_1,b_2,t_1 !=!mul_add_c(a[1],b[2],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != ld a(3),a_3 umul a_2,b_1,t_1 !mul_add_c(a[2],b[1],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 ld a(4),a_4 umul a_3,b_0,t_1 !mul_add_c(a[3],b[0],c1,c2,c3);!= addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(3) !r[3]=c1; umul a_4,b_0,t_1 !mul_add_c(a[4],b[0],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 umul a_3,b_1,t_1 !mul_add_c(a[3],b[1],c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 umul a_2,b_2,t_1 !=!mul_add_c(a[2],b[2],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != ld b(4),b_4 umul a_1,b_3,t_1 !mul_add_c(a[1],b[3],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 ld b(5),b_5 umul a_0,b_4,t_1 !=!mul_add_c(a[0],b[4],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != st c_2,r(4) !r[4]=c2; umul a_0,b_5,t_1 !mul_add_c(a[0],b[5],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 umul a_1,b_4,t_1 !mul_add_c(a[1],b[4],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_2,b_3,t_1 !=!mul_add_c(a[2],b[3],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != umul a_3,b_2,t_1 !mul_add_c(a[3],b[2],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 ld a(5),a_5 umul a_4,b_1,t_1 !mul_add_c(a[4],b[1],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 ld a(6),a_6 addx c_2,%g0,c_2 != umul a_5,b_0,t_1 !mul_add_c(a[5],b[0],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 st c_3,r(5) !r[5]=c3; umul a_6,b_0,t_1 !mul_add_c(a[6],b[0],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 umul a_5,b_1,t_1 !=!mul_add_c(a[5],b[1],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_4,b_2,t_1 !mul_add_c(a[4],b[2],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 umul a_3,b_3,t_1 !mul_add_c(a[3],b[3],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_2,b_4,t_1 !mul_add_c(a[2],b[4],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 ld b(6),b_6 addx c_3,%g0,c_3 != umul a_1,b_5,t_1 !mul_add_c(a[1],b[5],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 ld b(7),b_7 umul a_0,b_6,t_1 !mul_add_c(a[0],b[6],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 st c_1,r(6) !r[6]=c1; addx c_3,%g0,c_3 != umul a_0,b_7,t_1 !mul_add_c(a[0],b[7],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 != addx %g0,%g0,c_1 umul a_1,b_6,t_1 !mul_add_c(a[1],b[6],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 umul a_2,b_5,t_1 !mul_add_c(a[2],b[5],c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 umul a_3,b_4,t_1 !=!mul_add_c(a[3],b[4],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != umul a_4,b_3,t_1 !mul_add_c(a[4],b[3],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_5,b_2,t_1 !mul_add_c(a[5],b[2],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 ld a(7),a_7 umul a_6,b_1,t_1 !=!mul_add_c(a[6],b[1],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != umul a_7,b_0,t_1 !mul_add_c(a[7],b[0],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 st c_2,r(7) !r[7]=c2; umul a_7,b_1,t_1 !mul_add_c(a[7],b[1],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 umul a_6,b_2,t_1 !=!mul_add_c(a[6],b[2],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != umul a_5,b_3,t_1 !mul_add_c(a[5],b[3],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 umul a_4,b_4,t_1 !mul_add_c(a[4],b[4],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_3,b_5,t_1 !mul_add_c(a[3],b[5],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_2,b_6,t_1 !=!mul_add_c(a[2],b[6],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != umul a_1,b_7,t_1 !mul_add_c(a[1],b[7],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 ! addx c_2,%g0,c_2 st c_3,r(8) !r[8]=c3; umul a_2,b_7,t_1 !mul_add_c(a[2],b[7],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 umul a_3,b_6,t_1 !=!mul_add_c(a[3],b[6],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_4,b_5,t_1 !mul_add_c(a[4],b[5],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 umul a_5,b_4,t_1 !mul_add_c(a[5],b[4],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_6,b_3,t_1 !mul_add_c(a[6],b[3],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_7,b_2,t_1 !=!mul_add_c(a[7],b[2],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(9) !r[9]=c1; umul a_7,b_3,t_1 !mul_add_c(a[7],b[3],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 umul a_6,b_4,t_1 !mul_add_c(a[6],b[4],c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 umul a_5,b_5,t_1 !=!mul_add_c(a[5],b[5],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != umul a_4,b_6,t_1 !mul_add_c(a[4],b[6],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_3,b_7,t_1 !mul_add_c(a[3],b[7],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 st c_2,r(10) !r[10]=c2; umul a_4,b_7,t_1 !=!mul_add_c(a[4],b[7],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 != umul a_5,b_6,t_1 !mul_add_c(a[5],b[6],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 umul a_6,b_5,t_1 !mul_add_c(a[6],b[5],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_7,b_4,t_1 !mul_add_c(a[7],b[4],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 st c_3,r(11) !r[11]=c3; addx c_2,%g0,c_2 != umul a_7,b_5,t_1 !mul_add_c(a[7],b[5],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx %g0,%g0,c_3 umul a_6,b_6,t_1 !mul_add_c(a[6],b[6],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_5,b_7,t_1 !mul_add_c(a[5],b[7],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 st c_1,r(12) !r[12]=c1; addx c_3,%g0,c_3 != umul a_6,b_7,t_1 !mul_add_c(a[6],b[7],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 != addx %g0,%g0,c_1 umul a_7,b_6,t_1 !mul_add_c(a[7],b[6],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 st c_2,r(13) !r[13]=c2; umul a_7,b_7,t_1 !=!mul_add_c(a[7],b[7],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 nop != st c_3,r(14) !r[14]=c3; st c_1,r(15) !r[15]=c1; ret restore %g0,%g0,%o0 .type bn_mul_comba8,#function .size bn_mul_comba8,(.-bn_mul_comba8) .align 32 .global bn_mul_comba4 /* * void bn_mul_comba4(r,a,b) * BN_ULONG *r,*a,*b; */ bn_mul_comba4: save %sp,FRAME_SIZE,%sp ld a(0),a_0 ld b(0),b_0 umul a_0,b_0,c_1 !=!mul_add_c(a[0],b[0],c1,c2,c3); ld b(1),b_1 rd %y,c_2 st c_1,r(0) !r[0]=c1; umul a_0,b_1,t_1 !=!mul_add_c(a[0],b[1],c2,c3,c1); ld a(1),a_1 addcc c_2,t_1,c_2 rd %y,t_2 != addxcc %g0,t_2,c_3 addx %g0,%g0,c_1 ld a(2),a_2 umul a_1,b_0,t_1 !=!mul_add_c(a[1],b[0],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != st c_2,r(1) !r[1]=c2; umul a_2,b_0,t_1 !mul_add_c(a[2],b[0],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 ld b(2),b_2 umul a_1,b_1,t_1 !=!mul_add_c(a[1],b[1],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != ld b(3),b_3 umul a_0,b_2,t_1 !mul_add_c(a[0],b[2],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 st c_3,r(2) !r[2]=c3; umul a_0,b_3,t_1 !=!mul_add_c(a[0],b[3],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 != umul a_1,b_2,t_1 !mul_add_c(a[1],b[2],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 ld a(3),a_3 umul a_2,b_1,t_1 !mul_add_c(a[2],b[1],c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_3,b_0,t_1 !=!mul_add_c(a[3],b[0],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(3) !r[3]=c1; umul a_3,b_1,t_1 !mul_add_c(a[3],b[1],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 umul a_2,b_2,t_1 !mul_add_c(a[2],b[2],c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 umul a_1,b_3,t_1 !=!mul_add_c(a[1],b[3],c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != st c_2,r(4) !r[4]=c2; umul a_2,b_3,t_1 !mul_add_c(a[2],b[3],c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 umul a_3,b_2,t_1 !mul_add_c(a[3],b[2],c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 st c_3,r(5) !r[5]=c3; addx c_2,%g0,c_2 != umul a_3,b_3,t_1 !mul_add_c(a[3],b[3],c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != st c_1,r(6) !r[6]=c1; st c_2,r(7) !r[7]=c2; ret restore %g0,%g0,%o0 .type bn_mul_comba4,#function .size bn_mul_comba4,(.-bn_mul_comba4) .align 32 .global bn_sqr_comba8 bn_sqr_comba8: save %sp,FRAME_SIZE,%sp ld a(0),a_0 ld a(1),a_1 umul a_0,a_0,c_1 !=!sqr_add_c(a,0,c1,c2,c3); rd %y,c_2 st c_1,r(0) !r[0]=c1; ld a(2),a_2 umul a_0,a_1,t_1 !=!sqr_add_c2(a,1,0,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc %g0,t_2,c_3 addx %g0,%g0,c_1 != addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 st c_2,r(1) !r[1]=c2; addx c_1,%g0,c_1 != umul a_2,a_0,t_1 !sqr_add_c2(a,2,0,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx %g0,%g0,c_2 addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != ld a(3),a_3 umul a_1,a_1,t_1 !sqr_add_c(a,1,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 st c_3,r(2) !r[2]=c3; umul a_0,a_3,t_1 !=!sqr_add_c2(a,3,0,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 != addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 ld a(4),a_4 addx c_3,%g0,c_3 != umul a_1,a_2,t_1 !sqr_add_c2(a,2,1,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(3) !r[3]=c1; umul a_4,a_0,t_1 !sqr_add_c2(a,4,0,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_3,a_1,t_1 !sqr_add_c2(a,3,1,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 ld a(5),a_5 umul a_2,a_2,t_1 !sqr_add_c(a,2,c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 st c_2,r(4) !r[4]=c2; addx c_1,%g0,c_1 != umul a_0,a_5,t_1 !sqr_add_c2(a,5,0,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx %g0,%g0,c_2 addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != umul a_1,a_4,t_1 !sqr_add_c2(a,4,1,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != ld a(6),a_6 umul a_2,a_3,t_1 !sqr_add_c2(a,3,2,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 st c_3,r(5) !r[5]=c3; umul a_6,a_0,t_1 !sqr_add_c2(a,6,0,c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx %g0,%g0,c_3 addcc c_1,t_1,c_1 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_5,a_1,t_1 !sqr_add_c2(a,5,1,c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 addcc c_1,t_1,c_1 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 umul a_4,a_2,t_1 !sqr_add_c2(a,4,2,c1,c2,c3); addcc c_1,t_1,c_1 != rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 addcc c_1,t_1,c_1 != addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 ld a(7),a_7 umul a_3,a_3,t_1 !=!sqr_add_c(a,3,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(6) !r[6]=c1; umul a_0,a_7,t_1 !sqr_add_c2(a,7,0,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_1,a_6,t_1 !sqr_add_c2(a,6,1,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_2,a_5,t_1 !sqr_add_c2(a,5,2,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_3,a_4,t_1 !sqr_add_c2(a,4,3,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 st c_2,r(7) !r[7]=c2; umul a_7,a_1,t_1 !sqr_add_c2(a,7,1,c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 addcc c_3,t_1,c_3 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_6,a_2,t_1 !sqr_add_c2(a,6,2,c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 addcc c_3,t_1,c_3 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_5,a_3,t_1 !sqr_add_c2(a,5,3,c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 addcc c_3,t_1,c_3 != addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_4,a_4,t_1 !sqr_add_c(a,4,c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 st c_3,r(8) !r[8]=c3; addx c_2,%g0,c_2 != umul a_2,a_7,t_1 !sqr_add_c2(a,7,2,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx %g0,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_3,a_6,t_1 !sqr_add_c2(a,6,3,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_4,a_5,t_1 !sqr_add_c2(a,5,4,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(9) !r[9]=c1; umul a_7,a_3,t_1 !sqr_add_c2(a,7,3,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_6,a_4,t_1 !sqr_add_c2(a,6,4,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_5,a_5,t_1 !sqr_add_c(a,5,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 st c_2,r(10) !r[10]=c2; umul a_4,a_7,t_1 !=!sqr_add_c2(a,7,4,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 != addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 umul a_5,a_6,t_1 !=!sqr_add_c2(a,6,5,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx c_2,%g0,c_2 != addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 st c_3,r(11) !r[11]=c3; addx c_2,%g0,c_2 != umul a_7,a_5,t_1 !sqr_add_c2(a,7,5,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx %g0,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_6,a_6,t_1 !sqr_add_c(a,6,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 st c_1,r(12) !r[12]=c1; umul a_6,a_7,t_1 !sqr_add_c2(a,7,6,c2,c3,c1); addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 addcc c_2,t_1,c_2 != rd %y,t_2 addxcc c_3,t_2,c_3 st c_2,r(13) !r[13]=c2; addx c_1,%g0,c_1 != umul a_7,a_7,t_1 !sqr_add_c(a,7,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 != st c_3,r(14) !r[14]=c3; st c_1,r(15) !r[15]=c1; ret restore %g0,%g0,%o0 .type bn_sqr_comba8,#function .size bn_sqr_comba8,(.-bn_sqr_comba8) .align 32 .global bn_sqr_comba4 /* * void bn_sqr_comba4(r,a) * BN_ULONG *r,*a; */ bn_sqr_comba4: save %sp,FRAME_SIZE,%sp ld a(0),a_0 umul a_0,a_0,c_1 !sqr_add_c(a,0,c1,c2,c3); ld a(1),a_1 != rd %y,c_2 st c_1,r(0) !r[0]=c1; ld a(1),a_1 umul a_0,a_1,t_1 !=!sqr_add_c2(a,1,0,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 addxcc %g0,t_2,c_3 addx %g0,%g0,c_1 != ld a(2),a_2 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 != st c_2,r(1) !r[1]=c2; umul a_2,a_0,t_1 !sqr_add_c2(a,2,0,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 != addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 != addx c_2,%g0,c_2 ld a(3),a_3 umul a_1,a_1,t_1 !sqr_add_c(a,1,c3,c1,c2); addcc c_3,t_1,c_3 != rd %y,t_2 addxcc c_1,t_2,c_1 st c_3,r(2) !r[2]=c3; addx c_2,%g0,c_2 != umul a_0,a_3,t_1 !sqr_add_c2(a,3,0,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx %g0,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != umul a_1,a_2,t_1 !sqr_add_c2(a,2,1,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != addx c_3,%g0,c_3 addcc c_1,t_1,c_1 addxcc c_2,t_2,c_2 addx c_3,%g0,c_3 != st c_1,r(3) !r[3]=c1; umul a_3,a_1,t_1 !sqr_add_c2(a,3,1,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx %g0,%g0,c_1 addcc c_2,t_1,c_2 addxcc c_3,t_2,c_3 != addx c_1,%g0,c_1 umul a_2,a_2,t_1 !sqr_add_c(a,2,c2,c3,c1); addcc c_2,t_1,c_2 rd %y,t_2 != addxcc c_3,t_2,c_3 addx c_1,%g0,c_1 st c_2,r(4) !r[4]=c2; umul a_2,a_3,t_1 !=!sqr_add_c2(a,3,2,c3,c1,c2); addcc c_3,t_1,c_3 rd %y,t_2 addxcc c_1,t_2,c_1 addx %g0,%g0,c_2 != addcc c_3,t_1,c_3 addxcc c_1,t_2,c_1 st c_3,r(5) !r[5]=c3; addx c_2,%g0,c_2 != umul a_3,a_3,t_1 !sqr_add_c(a,3,c1,c2,c3); addcc c_1,t_1,c_1 rd %y,t_2 addxcc c_2,t_2,c_2 != st c_1,r(6) !r[6]=c1; st c_2,r(7) !r[7]=c2; ret restore %g0,%g0,%o0 .type bn_sqr_comba4,#function .size bn_sqr_comba4,(.-bn_sqr_comba4) .align 32
