Hi again!
>
> And finally.
I slept over it and want you to disregard the following statement of
mine:
> ... It (*) doesn't make any difference to my UltraSPARC-specific
> implementation (as I exploit branches on register condition with
> prediction) ...
> (*) unrolling loops in below way
because it's wrong. Unrolling loops as following:
> while (num&~3) {
> mul_add(rp[0],ap[0],w,c1);
> mul_add(rp[1],ap[1],w,c1);
> mul_add(rp[2],ap[2],w,c1);
> mul_add(rp[3],ap[3],w,c1);
> ap+=4; rp+=4; num-=4;
> }
> while (num) {
> mul_add(rp[0],ap[0],w,c1);
> if (--num == 0) break;
> mul_add(rp[1],ap[1],w,c1);
> if (--num == 0) break;
> mul_add(rp[2],ap[2],w,c1);
> if (--num == 0) break;
> mul_add(rp[3],ap[3],w,c1);
> if (--num == 0) break;
> }
would make extra good even to v9. Indeed! Examine following snippet
corresponding to 'mul_add(rp[3],ap[3],w,c1); if (--num == 0) break;':
> lduw [%o1+8],%o5
> mulx %o3,%o5,%o5
> lduw [%o0+8],%o4
> add %o5,%g1,%o5
> dec %o2
> add %o4,%o5,%o4
> srlx %o4,32,%g1
> brz,pn %o2,.L_bn_mul_add_words_ret
> stuw %o4,[%o0+8]
First two lines! A register is loaded and used in the *next*
instruction. The way it is right now, it's not possible to move lduw
"higher" in loop, because then you risk to cause SEGV when the very end
of bn coincide with the data segment edge set by brk(2). Now if I unroll
the loop in the above way, I'll be able to *safely* reschedule lduw and
avoid stalls caused by the data not being available in the register.
Bottom line. Expect version 1.1 implemention after this weekend:-) And
OK, I can cut-n-paste together v8 version as well if you want me to...
Cheers. Andy.
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [EMAIL PROTECTED]
Automated List Manager [EMAIL PROTECTED]