I just checked in a (still experimental!) mod to the Goldilocks code which 
enables p521.  Compile with:

make clean all test bench FIELD=p521 CC=clang XCFLAGS=-DGOLDI_FIELD_BITS=521 
ARCH=arch_x86_64_r12

Probably requires AVX2.

Current Haswell cycle count:
        keygen: 270kcy
        ecdh: 803kcy
        sign: 283kcy
        verif: 907kcy

This is with the same high-level algorithms as the other Goldilocks benchmarks: 
constant-time, no variable lookups, compressed points, includes the hashing.  
HT and turbo off.

So E-521 can be almost as fast as Granger and Scott said (or maybe faster, who 
knows), but it’s tricky to get there.

Cheers,
— Mike



> On Oct 23, 2014, at 6:38 PM, Michael Hamburg <[email protected]> wrote:
> 
>> 
>> On Oct 23, 2014, at 10:05 AM, Trevor Perrin <[email protected]> wrote:
>> 
>> On Thu, Oct 23, 2014 at 5:04 AM, Samuel Neves <[email protected]> wrote:
>>> 
>>> The Haswell cycle counts mentioned in the paper do not take Turbo Boost 
>>> into account, and therefore are lower than the
>>> real number; taking into account that the Core i7 4770 chip was used (3.4 
>>> to 3.9 GHz overclocking), the Haswell cycle
>>> count should be ~893000.  I have been able to get this slightly down to 
>>> ~884000.
>>> 
>>> On Sandy Bridge, I get somewhat better timings than reported by DJB: 
>>> ~1030000 cycles.
>> 
>> Thanks!, updated [1].
>> 
>> By that scoring, Mike's Goldilocks implementation retains the
>> "relative efficiency" crown.  But the E-521 numbers are without ASM
>> optimization.  And their 9 limbs / 58-bit radix seems impressive
>> (Goldlilocks uses 8 limbs / 56-bit radix).
>> 
>> So this seems pretty close, I wonder what a better-optimized 521 could do...
>> 
>> 
>> Trevor
>> 
>> 
>> [1] 
>> https://docs.google.com/a/trevp.net/spreadsheet/ccc?key=0Aiexaz_YjIpddFJuWlNZaDBvVTRFSjVYZDdjakxoRkE&usp=sharing#gid=0
> 
> The Goldilocks code is almost ready to support E-521.  As a warmup non-Ed448 
> curve, I took preliminary benchmarks for Ed480-Ridinghood.  From one 
> benchmark run (not SUPERCOP, etc):
>        Goldilocks: 178kcy keygen, 536kcy ecdh
>        Ridinghood: 193kcy keygen, 617kcy ecdh
> Difference = +8%, +15%.
> 
> The +15% reflects some sections which aren’t optimized yet, along the lines 
> of if (EDWARDS_D > 0) { do something slow; } or if (Mike hasn’t calculated 
> the carry handling limits yet) { reduce just to be safe; }
> 
> I also have a 521-bit multiplier which takes 145 Haswell cycles in 
> preliminary benchmarks.  Like Granger-Scott, it uses 9 limbs of 58 bits each. 
>  It’s still using 3-way Chung-Hasan, so it does more multiplies and fewer 
> adds than the Granger-Scott technique.  Its speed advantage, if it actually 
> has one, is probably from tighter tuning.  But if that’s accurate it might be 
> comparably fast to what Granger and Scott quoted (but measured properly, with 
> TurboBoost off).
> 
> — Mike
> _______________________________________________
> Curves mailing list
> [email protected]
> https://moderncrypto.org/mailman/listinfo/curves

_______________________________________________
Curves mailing list
[email protected]
https://moderncrypto.org/mailman/listinfo/curves

Reply via email to