> On Jan 20, 2015, at 12:42 PM, Trevor Perrin <[email protected]> wrote:
> 
> On Mon, Jan 19, 2015 at 8:09 PM, Mike Hamburg <[email protected]> wrote:
>> 
>> The point of the Goldilocks prime is that Karatsuba is a win at that size,
>> and Karatsuba is a pain on full-radix.  So even though Ed448 is suitably
>> aligned for a full-radix implementation on 32 bits, that would take 14^2 =
>> 196 muls vs 16^2 * 3/4 = 192 for Karatsuba.  The extra adds for Karatsuba
>> cost less than the carry handling for full-radix unless you have some sort
>> of accelerator, and I don't think UMAAL is enough of an accelerator to make
>> up that difference. 64-bit is similar except the coefficients aren't
>> aligned.  So it's almost always a win to use Karatsuba and reduced-radix.
> 
> Thanks for the analysis.  It seems the crux of your argument is that
> Karatsuba (and other fancy multiplications) favors reduced-radix on
> many of the platforms where Longa was arguing for full-radix?

Well that, and also that the advantage of full-radix diminishes as the primes 
get larger.  So supporting full-radix is important for ~256-bit, and this 
strategy is worth trying even on Intel, but for larger primes it becomes less 
important. It’s still worth considering in design, though.

> If you have time it might be educational to explain why Karatsuba
> works better for reduced-radix.

You have to add the inputs to Karatsuba.  If you’re working with a full-radix 
number there will be carries.  In a straightforward, completely full-radix case 
the sum is necessarily longer by one word, which negates much of the advantage.

You could instead break eg a 380-bit number into two 190-bit halves, and while 
you’d need a carry chain to add those halves you could at least avoid overflow. 
 But then you'd still have to be careful with sign extensions when computing 
the product, because the intermediate subtractions in Karatsuba can borrow.  On 
the other hand, in the usual reduced-radix implementation the subtractions 
can’t borrow.  This is because if you don’t carry between digits, the operation 
is equivalent to one on polynomials, and when multiplying two polynomials with 
positive digits the result has positive digits.  This is still true if you 
include the reduction step, so long as your prime has all negative coefficients 
except for the leading one (eg, 2^big-small, or 2^448-2^224-1).

>> This analysis doesn't support Longa's point, since it shows cases where
>> reduced radix is very fast, even faster than full-radix on a
>> comparable-sized prime on a machine where that's fast.
> 
> But if a 448-ish prime was chosen to take into account full-radix
> (maybe not as close to a power of 2 as Goldilocks?), that might change
> this analysis?

No, I would expect it to outperform eg 2^444-17 (from Ben Harris’ Pareto list) 
almost everywhere.

> Even if Goldilocks has similar performance to other 448-bit primes on
> platforms that favor full-radix, that's a weakening of the Goldilocks
> argument that it's getting "extra" security margin over 384 very
> cheaply: it would be getting this "extra" margin, on these platforms,
> at full price.


Could be.  A back-of-the-envelope calculation suggests that 2^379-19 could 
match Goldilocks in efficiency (n^lg3-wise) on Tegra2, but the implementation 
strategies are very different so you’d need to test it.  It’s generally true 
that Goldilocks is especially good on platforms that already have fast 
reduced-radix arithmetic, like Haswell and ARM NEON.  Maybe if MSR would submit 
Ted37919 or Ed384-mers to SUPERCOP, we could find out how they compare on 
platforms which don’t favor reduced-radix like Atom and K10 and Bulldozer; and 
which disfavor it like Tegra2.

Overall, Goldilocks may not be most efficient prime on any one platform (except 
maybe NEON), but it has good performance across many platforms, and I have put 
in quite a bit of work to demonstrate this.  I appreciate Longa et al’s 
apparent point (I say “apparent” because I’m not them and I wasn’t at RWC) that 
primes should be chosen with this property.  Their proposed 2^379-19 may work 
well across platforms, but this needs to be tested.

Cheers,
— Mike

_______________________________________________
Curves mailing list
[email protected]
https://moderncrypto.org/mailman/listinfo/curves

Reply via email to