On Wed, 2021-12-01 at 22:58 +0100, Niels Möller wrote: > Amitay Isaacs <[email protected]> writes: > > > --- /dev/null > > +++ b/powerpc64/ecc-secp256r1-redc.asm > > @@ -0,0 +1,144 @@ > > +C powerpc64/ecc-secp256r1-redc.asm > > +ifelse(` > > + Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM > > Corporation > > + > > + Based on x86_64/ecc-secp256r1-redc.asm > > Looks good, and it seems method follows the x86_64 version closely. I > just checked in a correction and a clarification to the comments to > the > x86_64 version. > > A few comments below. > > > +C Register usage: > > + > > +define(`SP', `r1') > > + > > +define(`RP', `r4') > > +define(`XP', `r5') > > + > > +define(`F0', `r3') > > +define(`F1', `r6') > > +define(`F2', `r7') > > +define(`F3', `r8') > > + > > +define(`U0', `r9') > > +define(`U1', `r10') > > +define(`U2', `r11') > > +define(`U3', `r12') > > +define(`U4', `r14') > > +define(`U5', `r15') > > +define(`U6', `r16') > > +define(`U7', `r17') > > One could save one register by letting U7 and XP overlap, since XP > isn't > used after loading U7. > > > + .file "ecc-secp256r1-redc.asm" > > + > > +C FOLD(x), sets (F3,F2,F1,F0) <-- [(x << 224) - (x << 192) - (x > > << 96)] >> 64 > > +define(`FOLD', ` > > + sldi F2, $1, 32 > > + srdi F3, $1, 32 > > + li F0, 0 > > + li F1, 0 > > + subfc F0, F2, F0 > > + subfe F1, F3, F1 > > I think the > > li F0, 0 > li F1, 0 > subfc F0, F2, F0 > subfe F1, F3, F1 > > could be replaced with > > subfic F0, F2, 0 C "negate with borrow" > subfze F1, F3 > > If that is measurably faster, I can't say.
You are quick to find the exactly fitting instruction. Yes, it definitely does the same job with two less instructions and gives about 1% speedup for only reduction code. > > Another option: Since powerpc, like arm, seems to use the proper > two's > complement convention that "borrow" is not carry, maybe we don't need > to > negate to F0 and F1 at all, and instead change the later subtraction, > replacing > > subfc U1, F0, U1 > subfe U2, F1, U2 > subfe U3, F2, U3 > subfe U0, F3, U0 > > with > > addc U1, F0, U1 > adde U2, F1, U2 > subfe U3, F2, U3 > subfe U0, F3, U0 > > I haven't thought that through, but it does make some sense to me. I > think the arm code propagates carry through a mix of add and sub > instructions in a some places. Maybe F2 needs to be incremented > somewhere for this to work, but probably still cheaper. If this > works, > FOLD would turn into something like > > sldi F0, $1, 32 > srdi F1, $1, 32 > subfc F2, $1, F0 > addme F3, F1 > > (If you want to investigate this later on, that's fine too, I could > merge > the code with the current folding logic). > > > + C If carry, we need to add in > > + C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> > > + li F0, 0 > > + addze F0, F0 > > + neg F2, F0 > > + sldi F1, F2, 32 > > + srdi F3, F2, 32 > > + li U7, -2 > > + and F3, F3, U7 > > I think the three instructions to set F3 could be replaced with > > srdi F3, F2, 31 > sldi F3, F3, 1 > > Or maybe the and operation is faster than shift? > > Regards, > /Niels I will continue to investigate the suggestions you have made. Amitay. -- There are two times in a man's life when he should not speculate: when he can't afford it, and when he can. - Mark Twain _______________________________________________ nettle-bugs mailing list [email protected] http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
