I just checked in a (still experimental!) mod to the Goldilocks code which
enables p521. Compile with:
make clean all test bench FIELD=p521 CC=clang XCFLAGS=-DGOLDI_FIELD_BITS=521
ARCH=arch_x86_64_r12
Probably requires AVX2.
Current Haswell cycle count:
keygen: 270kcy
ecdh: 803kcy
sign: 283kcy
verif: 907kcy
This is with the same high-level algorithms as the other Goldilocks benchmarks:
constant-time, no variable lookups, compressed points, includes the hashing.
HT and turbo off.
So E-521 can be almost as fast as Granger and Scott said (or maybe faster, who
knows), but it’s tricky to get there.
Cheers,
— Mike
> On Oct 23, 2014, at 6:38 PM, Michael Hamburg <[email protected]> wrote:
>
>>
>> On Oct 23, 2014, at 10:05 AM, Trevor Perrin <[email protected]> wrote:
>>
>> On Thu, Oct 23, 2014 at 5:04 AM, Samuel Neves <[email protected]> wrote:
>>>
>>> The Haswell cycle counts mentioned in the paper do not take Turbo Boost
>>> into account, and therefore are lower than the
>>> real number; taking into account that the Core i7 4770 chip was used (3.4
>>> to 3.9 GHz overclocking), the Haswell cycle
>>> count should be ~893000. I have been able to get this slightly down to
>>> ~884000.
>>>
>>> On Sandy Bridge, I get somewhat better timings than reported by DJB:
>>> ~1030000 cycles.
>>
>> Thanks!, updated [1].
>>
>> By that scoring, Mike's Goldilocks implementation retains the
>> "relative efficiency" crown. But the E-521 numbers are without ASM
>> optimization. And their 9 limbs / 58-bit radix seems impressive
>> (Goldlilocks uses 8 limbs / 56-bit radix).
>>
>> So this seems pretty close, I wonder what a better-optimized 521 could do...
>>
>>
>> Trevor
>>
>>
>> [1]
>> https://docs.google.com/a/trevp.net/spreadsheet/ccc?key=0Aiexaz_YjIpddFJuWlNZaDBvVTRFSjVYZDdjakxoRkE&usp=sharing#gid=0
>
> The Goldilocks code is almost ready to support E-521. As a warmup non-Ed448
> curve, I took preliminary benchmarks for Ed480-Ridinghood. From one
> benchmark run (not SUPERCOP, etc):
> Goldilocks: 178kcy keygen, 536kcy ecdh
> Ridinghood: 193kcy keygen, 617kcy ecdh
> Difference = +8%, +15%.
>
> The +15% reflects some sections which aren’t optimized yet, along the lines
> of if (EDWARDS_D > 0) { do something slow; } or if (Mike hasn’t calculated
> the carry handling limits yet) { reduce just to be safe; }
>
> I also have a 521-bit multiplier which takes 145 Haswell cycles in
> preliminary benchmarks. Like Granger-Scott, it uses 9 limbs of 58 bits each.
> It’s still using 3-way Chung-Hasan, so it does more multiplies and fewer
> adds than the Granger-Scott technique. Its speed advantage, if it actually
> has one, is probably from tighter tuning. But if that’s accurate it might be
> comparably fast to what Granger and Scott quoted (but measured properly, with
> TurboBoost off).
>
> — Mike
> _______________________________________________
> Curves mailing list
> [email protected]
> https://moderncrypto.org/mailman/listinfo/curves
_______________________________________________
Curves mailing list
[email protected]
https://moderncrypto.org/mailman/listinfo/curves