On Wed, 2021-12-01 at 22:58 +0100, Niels Möller wrote:
> Amitay Isaacs <[email protected]> writes:
> 
> > --- /dev/null
> > +++ b/powerpc64/ecc-secp256r1-redc.asm
> > @@ -0,0 +1,144 @@
> > +C powerpc64/ecc-secp256r1-redc.asm
> > +ifelse(`
> > +   Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM
> > Corporation
> > +
> > +   Based on x86_64/ecc-secp256r1-redc.asm
> 
> Looks good, and it seems method follows the x86_64 version closely. I
> just checked in a correction and a clarification to the comments to
> the
> x86_64 version.
> 
> A few comments below.
> 
> > +C Register usage:
> > +
> > +define(`SP', `r1')
> > +
> > +define(`RP', `r4')
> > +define(`XP', `r5')
> > +
> > +define(`F0', `r3')
> > +define(`F1', `r6')
> > +define(`F2', `r7')
> > +define(`F3', `r8')
> > +
> > +define(`U0', `r9')
> > +define(`U1', `r10')
> > +define(`U2', `r11')
> > +define(`U3', `r12')
> > +define(`U4', `r14')
> > +define(`U5', `r15')
> > +define(`U6', `r16')
> > +define(`U7', `r17')
> 
> One could save one register by letting U7 and XP overlap, since XP
> isn't
> used after loading U7.
> 
> > +       .file "ecc-secp256r1-redc.asm"
> > +
> > +C FOLD(x), sets (F3,F2,F1,F0)  <-- [(x << 224) - (x << 192) - (x
> > << 96)] >> 64
> > +define(`FOLD', `
> > +       sldi    F2, $1, 32
> > +       srdi    F3, $1, 32
> > +       li      F0, 0
> > +       li      F1, 0
> > +       subfc   F0, F2, F0
> > +       subfe   F1, F3, F1
> 
> I think the 
> 
>         li      F0, 0
>         li      F1, 0
>         subfc   F0, F2, F0
>         subfe   F1, F3, F1
> 
> could be replaced with 
> 
>         subfic  F0, F2, 0    C "negate with borrow"
>         subfze  F1, F3 
> 
> If that is measurably faster, I can't say. 

You are quick to find the exactly fitting instruction.  Yes, it
definitely does the same job with two less instructions and gives about
1% speedup for only reduction code.

> 
> Another option: Since powerpc, like arm, seems to use the proper
> two's
> complement convention that "borrow" is not carry, maybe we don't need
> to
> negate to F0 and F1 at all, and instead change the later subtraction,
> replacing
> 
>         subfc   U1, F0, U1
>         subfe   U2, F1, U2
>         subfe   U3, F2, U3
>         subfe   U0, F3, U0
> 
> with
> 
>         addc    U1, F0, U1
>         adde    U2, F1, U2
>         subfe   U3, F2, U3
>         subfe   U0, F3, U0
> 
> I haven't thought that through, but it does make some sense to me. I
> think the arm code propagates carry through a mix of add and sub
> instructions in a some places. Maybe F2 needs to be incremented
> somewhere for this to work, but probably still cheaper. If this
> works,
> FOLD would turn into something like
> 
>         sldi    F0, $1, 32
>         srdi    F1, $1, 32
>         subfc   F2, $1, F0
>         addme   F3, F1
> 
> (If you want to investigate this later on, that's fine too, I could
> merge
> the code with the current folding logic).
> 
> > +       C If carry, we need to add in
> > +       C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1>
> > +       li      F0, 0
> > +       addze   F0, F0
> > +       neg     F2, F0
> > +       sldi    F1, F2, 32
> > +       srdi    F3, F2, 32
> > +       li      U7, -2
> > +       and     F3, F3, U7
> 
> I think the three instructions to set F3 could be replaced with
> 
>         srdi    F3, F2, 31
>         sldi    F3, F3, 1
> 
> Or maybe the and operation is faster than shift?
> 
> Regards,
> /Niels

I will continue to investigate the suggestions you have made.

Amitay.
-- 

There are two times in a man's life when he should not speculate: when
he
can't afford it, and when he can. - Mark Twain
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

Reply via email to