Thanks Trevor!
> On Jan 19, 2015, at 1:47 PM, Trevor Perrin <[email protected]> wrote:
>
> At Real World Crypto, Patrick Longa discussed the question of choosing
> efficient field primes for extra-strength curves. He suggested such
> primes should be chosen to strike a balance between efficiency in
> full-radix and reduced-radix implementations (i.e. "saturated" and
> "unsaturated" arithmetic):
>
> http://www.realworldcrypto.com/rwc2015/program-2/RWC-2015-Longa.pdf?attredirects=0
>
> 2^379 - 19 was given as an example.
>
> Most of the discussion of extra-strength primes I've seen assumes
> reduced-radix implementations, so this is an interesting twist. His
> argument for considering full-radix was:
>
> 1) Full-radix is more efficient on several platforms ("AMD, "Intel
> Atom, Intel Quark, ARM w/o Neon, microcontrollers"), whereas
> reduced-radix is most advantageous on more advanced processors ("Intel
> desktop/server", "ARM with NEON”).
This makes sense. Some points
* Intel’s desktop/server parts have a really fast MUL and slowish ADC compared
to everything else, so they favor reduced-radix computations which have more
MULs and fewer ADCs.
* Anything running on a vector unit (NEON) won’t have access to ADC at all, so
it will suck on full-radix.
* Medium-to-high-end ARM parts (M4 or higher) have an instruction UMAAL to
accelerate full-radix arithmetic, though it still isn’t as fast as NEON. They
also have an instruction UMLAL which accelerates reduced-radix arithmetic.
* Low-end ARMs have UMLAL but not UMAAL. Some might not even have that. Parts
with UMLAL and not UMAAL will favor reduced-radix.
* Other platforms may be in-between, if they don’t have acceleration for
either. A possible exception is RISC architectures with no carry flag and no
magic accum register (eg RISC-V), which tend to perform terribly on full-radix
code.
* Superscalar cores can multi-issue ADDs for the reduced-radix add/sub routine,
but can’t multi-issue ADC.
* The advantage of full-radix decreases as the number of limbs increases,
because it’s usually n^2 vs (n+1)^2. That’s why for Curve25519 favors
full-radix.
So the message that different processors favor different organizations is
correct, and it sort of correlates with CPU size, but not perfectly.
> 2) Full-radix may be safer and easier to implement, since
> reduced-radix requires "Bound analysis" to prevent inadvertent word
> spilling, thus is "error prone, errors are more difficult to catch”.
Maybe. With full-radix it can be annoying to get the carries right, though
it’s easier if the size is not exactly a multiple of 64 (eg 2^379 instead of
2^384) because then there’s headroom.
> I'm not qualified to assess (1). But if it's true that full-radix
> implementations are faster on more primitive platforms that seems
> significant, since optimizing for these is probably more important
> than achieving maximum speed on advanced processors.
I agree.
> Anyways, I'm curious what other people think of this approach, and of
> 2^379 - 19 in particular.
Seems fine to me. The prime 2^398 - 21 did bug me a little bit for this reason.
> The slides give performance numbers for a "Ted37919" curve based on
> this prime on Sandy Bridge, which I added to the spreadsheet here:
> https://docs.google.com/a/trevp.net/spreadsheet/ccc?key=0Aiexaz_YjIpddFJuWlNZaDBvVTRFSjVYZDdjakxoRkE&usp=sharing#gid=0
>
> <https://docs.google.com/a/trevp.net/spreadsheet/ccc?key=0Aiexaz_YjIpddFJuWlNZaDBvVTRFSjVYZDdjakxoRkE&usp=sharing#gid=0>
NB those numbers do not include point decompression.
Cheers,
— Mike
_______________________________________________
Curves mailing list
[email protected]
https://moderncrypto.org/mailman/listinfo/curves