Could we expect a github? I'd love to see this! Trevor
On Tue, Feb 11, 2014 at 12:31 AM, Mike Hamburg <[email protected]> wrote: > Hello curves, > > I've been working on implementation for the new curves, and I'd like to > report status and some formulas and issues I found. > > I'm aiming for a fairly generic C/intrinsics implementation which should > support any curves with minimal extra effort, but I'm starting with > Ed448-Goldilocks because it's mine. I have Haswell and Sandy Bridge test > machines. I also have a vectorless Cortex A9, but it doesn't work yet > because I'm using 64x64->128-bit multiply intrinsics. Here's what I've > found so far. > > If you have any suggestions on the formulas or algorithms, I'd definitely > appreciate it. > > Field arithmetic: > * Karatsuba is beneficial for Ed448. > * Radix 2^56 in a 64-bit limb, 8 limbs. > * M ~ 153cy on Sandy Bridge, 125cy on Haswell > * square ~ 0.75M > * small fixed mul ~ 0.25M > * add/sub (unreduced) ~ 0.04M, a little cheaper on Haswell because of AVX > > I'm using the 1/sqrt(x) point encoding for now, basically because I already > have code for that from an earlier project. I'm not yet counting the time > to serialize and deserialize field elements, which is maybe 100 cyles at > most (counting the full reduce / checking that input is fully reduced). I'm > not yet counting hashing or RNG times. > > My earlier email about 1/sqrt(x) was slightly off: it encodes even points on > the curve, but odd points on the twist. > > I haven't tried blind+EGCD for inverses or Legendre symbol checks. It might > well be a win. One inverse square root is 56k Sandy cycles (I don't > remember the Haswell number). > > Full Montgomery ladder: > * Decompress. > * Constant-time ladder by 448-bit scalar. The scalar should be even for > security. It actually could be 447 bits. > * Recompress. Reject points on the twist. This is basically free, but > important because they can't be encoded with the 1/sqrt(x) encoding. > > This takes about 571kcy on Haswell, and 688kcy on Sandy, corrected for > TurboBoost. > > I'm using the formula from the thread on efficient laddering with the > isomorphic curve, but twisted. Let (xd,zd) be the point to de doubled, and > (xa,za) be the point to be added. > A = (xd+zd) > B = (xd-zd) > DA = (xa-za)*A > BC = (xa+za)*B > > oxa = (DA+BC)^2 > oza = (DA-BC)^2 * xbase > > AA = A^2 > BB = B^2 > AAod = AA*(1-d) > E = AA-BB > > oxd = AAod*BB > ozd = E*(AAod-E) > > return (oxd,ozd,oxa,oza) > > Except I'm actually using zbase instead of xbase, because of the 1/sqrt(x) > format. > > Twisted Edwards (a=-1) windowed algorithm: > * Assumes that cofactor is canceled somehow. > * Recode scalar in signed form, because it's easy and I'm lazy. > * Compute 8 odd multiples of P. > * Constant-time add/sub chain with a 4-bit window, 448 bits. Could be 446 > bits, except that 446 isn't divisible by 4. > * No compress or decompress. > > This takes slightly less time than the Montgomery ladder, some 530kcy on > Haswell and 636 kcy on Sandy. A 5-bit window makes things maybe 1-2% > faster, but uses extra complexity and memory so I didn't think it was > worthwhile. > > I'm using readdition coordinates: > "Projective half-niels" for the tables, ((y-x)/2 : (y+x)/2 : dxy : 1) * z. > "Lazy extended coordinates" for the accumulator, (x : y : z : t : u) where > xy = tuz. > > I might replace the lazy extended coordinates with Hisil et al's lookahead > extended-or-not coordinates, which use less memory but require more care. > > Full constant-time scalarmul using twisted Edwards: > * Decompress points, rejecting those on the twist. > * Isogenize to the twisted curve, canceling the cofactor. > * Above windowed algorithm. > * Isogenize back to the main curve, effectively multiplying by 4. > * Recompress. > > This takes slightly longer than the Montgomery ladder: something like 633kcy > on Haswell and 750kcy on Sandy. So Edwards or twisted Edwards is best for > points you've already got in projective form, and Montgomery is best for > ECDH. Unsurprising. > > The total executable code size to test and bench the arithmetic and curve > routines is currently around 41k under clang -O4 -fPIC. That'll get bigger > once there are precomputed tables. > > Formulas: > I'm making use of the "inverse square root trick": > > def trick(a,b,i): > # assumes p==3 mod 4; similar trick exists for 1 mod 4 > # returns sqrt(+-a/b), 1/i, is_square(a/b) > # assumes a,b,i are nonzero > ai = a*i > abi = b*ai > s = 1/sqrt(+- abi*i) # using a powering ladder > output sqrt(+-a/b) = s*ai > s2abi = s^2*abi > issquare = s2abi * i # = Legendre symbol > if you care about the result of 1/i when a/b is nonsquare: > output 1/i = s2abi*issquare > else: > output 1/i = s2abi > > You can tweak the trick to change the Legendre symbol of the output > according to some other variable as well; this depends on the residue of p > mod 8. > > The formula I'm using for point compression with Montgomery form is: > > Let P1 + P2 = P3 and (u1,v1) = P1 etc. Then > 4*v1*v2*u3 = (u1*u2-1)^2 - u3^2*(u1-u2)^2 > > To compute the numerator of the RHS, do: > sa = (z2*z1 - x2*x1) * z3 > sb = (x2*z1 - z2*x1) * x3 > numerator = (sa + sb) * (sa - sb) > This is good enough to get the Legendre symbol. It shouldn't be too hard to > convert this into a formula with some other sign bit using the inverse > square root trick. > > This is on an untwisted (B=1) curve, but the same "ought" to be true of > 4*B*v1*v2*u3 on a twisted one. > > To serialize an Edwards point, we have to deal with the fact that the > isomorphic curve you'd get from Wikipedia is twisted, because it sets B = > 4/(1-d) which isn't square, at least when p==3 mod 4. So I'm negating x to > get to the curve: > 4y^2/(d-1) = x^3 + 2(d+1)/(d-1) * x^2 + x > where you can then scale y by sqrt(4/(d-1)) to get the standard curve. > > To deserialize an Edwards point, compute > denominator = (u+1)^2 * (d-1) + 4u > x = 2 sqrt(u/denominator) > y = (1+u)/(1-u) > using the inverse square root trick. This lands you on E_(1,d), because it > scales the x-coordinate to get rid of the twisting that the obvious > decompression would give you. > > You have to check if u=0 or u=1. The latter isn't on the curve, but you > have to make sure it doesn't slip past the check due to the zero divide. > The former works in the 1/sqrt(x) encoding without any checks. > > To do: > I'm planning to use WNAF for variable time scalar mul, WNAF for signature > verification, and a precomputed signed comb for key generation and Schnorr > signing. > > I'm experimenting with the best way to implement Elligator. I currently > only have the map to the curve done, and I might change the signs. My > implementation maps directly to affine using the inverse square root trick. > I'll report the formula once I'm done messing around with it. > > And of course there's API packaging, testing on ARM, etc. > > Cheers, > -- Mike > > > > > _______________________________________________ > Curves mailing list > [email protected] > https://moderncrypto.org/mailman/listinfo/curves > _______________________________________________ Curves mailing list [email protected] https://moderncrypto.org/mailman/listinfo/curves
