[email protected] (Niels Möller) writes:
> If this works,
> FOLD would turn into something like
>
> sldi F0, $1, 32
> srdi F1, $1, 32
> subfc F2, $1, F0
> addme F3, F1
I'm looking at a different approach (experimenting on ARM64, which is
quite similar to powerpc, but I don't yet have working code). To
understand what the redc code is doing we need to keep in mind that what
one folding step does is to compute
<U4,U3,U2,U1,U0> + U0*p
which cancels the low limb, since p = -1 (mod 2^64). So since the low
limb always cancel, what we need is
<U4,U3,U2,U1> + U0*((p+1)/2^64)
The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} -
p) * U0, subtracting in the folding step, and adding in the high part
later. But one doesn't have to do it that way. One could instead use a
FOLD macro that computes
(2^{192} - 2^{160} + 2^{128} + 2^{32}) U0
I also wonder of there's some way to use carry out from one fold step
and apply it at the right place while preparing the F0,F1,F2,F3 for the next
step.
Regards,
/Niels
--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
nettle-bugs mailing list
[email protected]
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs