Re: Fast constant-time gcd computation and modular inversion
ni...@lysator.liu.se (Niels Möller) writes: > I've now implemented inverse too. And now I've tried a different variant, using redc for the cofactor updates. Cofactors are now canonically reduced (which seems unexpectedly cheap). In the case that m is not normalized, so that 2m fits in n limbs, one could use a more relaxed range, 0 <= u, v < 2m, and get rid of a conditional adjustment when doing redc in the loop. Using redc is a rather good fit, because in each iteration, we shift f, g *almost* a limb (62 bits on a 64-bit machine), except for the last iteration where we do less bits. While cofactors are multiplied by 2^{-64} mod m in each iteration. So there's a discrepancy of a few bits per iteration, which one can handle using very few extra redc calls outside of the loop (it's O(n^2) work, but with a pretty small constant). Speed is rather close to the previous version, but the only needed precomputation is binvert_limb for the redc. And the calls to mpn_sec_mul and mpn_sec_div_r are gone. I also added benchmarking of the existing mpn_sec_invert, for comparison. This is how it looks now: invert size 1, ref 0.327 (us), old 2.771 (us), djb 1.087 (us) invert size 2, ref 0.757 (us), old 5.366 (us), djb 1.571 (us) invert size 3, ref 1.082 (us), old 8.110 (us), djb 2.336 (us) [...] invert size 17, ref 5.106 (us), old 163.962 (us), djb 18.082 (us) invert size 18, ref 5.189 (us), old 170.565 (us), djb 18.608 (us) invert size 19, ref 5.715 (us), old 185.143 (us), djb 20.794 (us) One possible optimization would be to keep cofactors in signed (two's complement) representation for the first few iterations, when they are still small and don't need any reduction. And cofactor update for the very first iteration could be much more efficient. Regards, /Niels /* Side channel silent gcd (work in progress) Copyright 2022 Niels Möller This is is free software; you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. The GNU MP Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with the GNU MP Library. If not, see https://www.gnu.org/licenses/. */ #include #include #include #include #include #include #if GMP_NUMB_BITS < GMP_LIMB_BITS #error Nails not supported. #endif #define BITCNT_BITS (sizeof(mp_bitcnt_t) * CHAR_BIT) #define STEPS (GMP_LIMB_BITS - 2) #define MAX(a,b) ((a) >= (b) ? (a) : (b)) struct matrix { /* Matrix elements interpreted as signed two's complement. Absolute value of elements is at most 2^STEPS. */ mp_limb_t a[2][2]; }; /* Conditionally set (a, b) <-- (b, -a) */ #define CND_NEG_SWAP_LIMB(cnd, a, b) do {\ mp_limb_t __cnd_sum = (a) + (b); \ mp_limb_t __cnd_diff = (a) - (b); \ (a) -= __cnd_diff & -(cnd); \ (b) -= __cnd_sum & -(cnd); \ } while (0) /* Perform STEPS elementary reduction steps on (delta, f, g). In the least significant STEPS bits of f, g matter. Note that mp_bitcnt_t is an unsigned type, but is used as two's complement. */ static mp_bitcnt_t steps(struct matrix *M, unsigned k, mp_bitcnt_t delta, mp_limb_t f, mp_limb_t g) { mp_limb_t a00, a01, a10, a11; assert (f & 1); /* Identity matrix. */ a00 = a11 = 1; a01 = a10 = 0; /* Preserve invariant (f ; g) = 2^{-i} M (f_orig, g_orig) */ for (; k-- > 0; delta++) { mp_limb_t odd = g & 1; mp_limb_t swap = odd & ~(delta >> (BITCNT_BITS-1)); /* Swap; (f'; g') = (g; -f) = (0,1;-1,0) (g;f) */ CND_NEG_SWAP_LIMB(swap, f, g); CND_NEG_SWAP_LIMB(swap, a00, a10); CND_NEG_SWAP_LIMB(swap, a01, a11); /* Conditional negation. */ delta = (delta ^ - (mp_bitcnt_t) swap) + swap; /* Cancel low bit and shift. */ g += f & -odd; a10 += a00 & -odd; a11 += a01 & -odd; g >>= 1; a00 <<= 1; a01 <<= 1; } M->a[0][0] = a00; M->a[0][1] = a01; M->a[1][0] = a10; M->a[1][1] = a11; return delta; } /* Set R = (u * F + v * G), treating all numbers as two's complement. No overlap allowed. */ static mp_limb_t add_add_mul (mp_ptr rp, const mp_limb_t *fp, const mp_limb_t *gp, mp_size_t n, mp_limb_t u, mp_limb_t v) { mp_limb_t f_sign = fp[n-1] >> (GMP_LIMB_BITS - 1); mp_limb_t g_sign = gp[n-1] >> (GMP_LIMB_BITS - 1); mp_limb_t u_sign = u >> (GMP_LIMB_BITS - 1); mp_limb_t v_sign = v >> (GMP_LIMB_BITS - 1); mp_limb_t hi = mpn_mul_1
Re: Fast constant-time gcd computation and modular inversion
Marco Bodrato writes: We should start writing mpn_sec_binvert :-) I think mpn_binvert is almost sec_ naturally. The exception is when sbpi1_bdiv_q.or dbpi1_bdiv_q c are invoked. The former has some < on data (for carry computations) and the latter has a mpn_incr_u which is very leaky. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
Il 2022-09-01 17:04 Torbjörn Granlund ha scritto: /* FIXME: Using mpz_invert is cheating. Instead, first compute m' = m^-1 (mod 2^k) via Newton/Hensel. We can then get the inverse via 2^{-k} (mod m) = (2^k - m') * m + 1)/2^k. */ mpz_invert (t, t, m); mpn_copyi (info->ip, mpz_limbs_read (t), mpz_size (t)); You might want to use mpn_binvert here. We should start writing mpn_sec_binvert :-) Or use the current mpn_sec_invert for the precomputation. Ĝis, m ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
/* FIXME: Using mpz_invert is cheating. Instead, first compute m' = m^-1 (mod 2^k) via Newton/Hensel. We can then get the inverse via 2^{-k} (mod m) = (2^k - m') * m + 1)/2^k. */ mpz_invert (t, t, m); mpn_copyi (info->ip, mpz_limbs_read (t), mpz_size (t)); You might want to use mpn_binvert here. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
Torbjörn Granlund writes: > Why do you use sec_invert when inverting mod the group order when that > is of prime order? (Yes, this question will become moot I suppose with > this new algorithm. No good reason, it's just that I implemented inverse-by-powering (with a hand-tuned addition chain) as a side effect of implementing square root, since in some cases they can share much of the addition chain, and that work touched field prime arithmetic only. Sorry we're getting a bit off topic, we should take nettle discussion elsewhere. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
Much more unclear to me how close it might be to the typical or average number of iterations needed. That's perhaps not very interesting, as early exit is not an option here. (Unless this algorithm would beat plain, "leaky" inverse.) Currently uses exponentiation for the field inverse, and nettle's version of sec_invert for inverses mod the group order, as needed by ecdsa. Except for the somewhat obscure gost curves (Russian standards), where sec_invert is used for all inverses, and A/B for gost-gc512a is almost 30%. Why do you use sec_invert when inverting mod the group order when that is of prime order? (Yes, this question will become moot I suppose with this new algorithm. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
Torbjörn Granlund writes: > That count is a proven "upper bound" of the numbver of iterations. Does > the paper discuss how tight it is, i.e., is it close to a "lower bound" > of the required number of iterations. As an upper bound, I think it's rather tight at least for smallish sizes, since proof includes an exhastive search up to some small limit on bit size, and the results from that guide the approach for the general proof. (It's rather complex, I don't get all the details). Much more unclear to me how close it might be to the typical or average number of iterations needed. > Comparing to the existing side-channel-silent inversion code would > impress more, I think. That would be a very relevant benchmark, yes. Omitted in this version just because interface is a bit different from what's used in the prototype. > What percentage of Ec scalar mul is used for the final inversion? Not > much, I suppose. After a quick look at my benchmark numbers, if one compares the time for A. inversion modulo the field prime, and B. scalar multiplication using the fixed generator point, and with output in projective or jacobian coordinates (i.e., inverse free), then ratio A/B is in the range 10% (smaller curves) to 16% for ed448. > Does Nettle use exponentiation there today, or > mpn_sec_invert? Currently uses exponentiation for the field inverse, and nettle's version of sec_invert for inverses mod the group order, as needed by ecdsa. Except for the somewhat obscure gost curves (Russian standards), where sec_invert is used for all inverses, and A/B for gost-gc512a is almost 30%. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
> count = (49 * bits + 57) / 17; > > Odd. For sure. This isn't based on local progress of the algorithm (there ain't no guaranteed progress for a short sequence of reduction steps), but based on rather complex analysis of the number of steps needed for the complete 2-adic quotient sequence. That count is a proven "upper bound" of the numbver of iterations. Does the paper discuss how tight it is, i.e., is it close to a "lower bound" of the required number of iterations. > I think your measurements are a bit optimistic, though. My measruments > from slightly modified timing code suggest 4.5x slowdown, and more for > really small operands. Maybe, I tried to keep the timing really simple and portable. I simply up'ed the number of iterations (actually made them depend in the operand size). I've now implemented inverse too. See updated code below. When I run it, I get invert size 1, ref 0.293 (us), djb 1.145 (us) invert size 2, ref 0.721 (us), djb 1.659 (us) invert size 3, ref 0.994 (us), djb 2.375 (us) [...] invert size 17, ref 5.157 (us), djb 18.538 (us) invert size 18, ref 5.207 (us), djb 19.064 (us) invert size 19, ref 5.702 (us), djb 21.286 (us) so a slowdown of 3--4 compared to mpz_invert. This timing excludes the precomputation of a few needed per-modulo constants. Very very nice! People not famililiar with what we're trying to do here might be unimpressed with these results. "Look, I've written new code which is 4 times slower than what we have, HURRAY!" Comparing to the existing side-channel-silent inversion code would impress more, I think. I think the inverse flavor is still rather neat, main loop is for (delta = 1;count > STEPS;count -= STEPS) { delta = steps (, STEPS, delta, fp[0], gp[0]); matrix_vector_mul (, STEPS, n+1, fp, gp, tp); matrix_vector_mul_mod (, Bp, n+2, up, vp, tp); } I can make it even neater: for (delta = 1;count > STEPS;count -= STEPS) { do_it_all (fp, gp, up, vp, n); } :-) I'm thinking I'll try to implement and benchmark this for Nettle's ecc functions first, before attempting to update GMP function. The reason is that (i) I really want to do needed precomputations for all moduli of interest for Nettle at compile time, which would complicate the GMP interface if it is supported at all, and (ii) in some cases I want inversion to operate on numbers in montgomery representation, and then I'd like to fold any needed additional power of two into the precomputed constant. What percentage of Ec scalar mul is used for the final inversion? Not much, I suppose. Does Nettle use exponentiation there today, or mpn_sec_invert? -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
Torbjörn Granlund writes: > ni...@lysator.liu.se (Niels Möller) writes: > > count = (49 * bits + 57) / 17; > > Odd. For sure. This isn't based on local progress of the algorithm (there ain't no guaranteed progress for a short sequence of reduction steps), but based on rather complex analysis of the number of steps needed for the complete 2-adic quotient sequence. > I think your measurements are a bit optimistic, though. My measruments > from slightly modified timing code suggest 4.5x slowdown, and more for > really small operands. Maybe, I tried to keep the timing really simple and portable. > This will still completely outclass the current sec code. I've now implemented inverse too. See updated code below. When I run it, I get invert size 1, ref 0.293 (us), djb 1.145 (us) invert size 2, ref 0.721 (us), djb 1.659 (us) invert size 3, ref 0.994 (us), djb 2.375 (us) [...] invert size 17, ref 5.157 (us), djb 18.538 (us) invert size 18, ref 5.207 (us), djb 19.064 (us) invert size 19, ref 5.702 (us), djb 21.286 (us) so a slowdown of 3--4 compared to mpz_invert. This timing excludes the precomputation of a few needed per-modulo constants. I haven't digged deeper into the performance, but I would expect that the "steps" function is significantly faster than hgcd2, since it's straight and simple code, with no unpredictable branches. But since average progress for each computed transformation is less, we'll need a larger number of iterations in the outer loop (and for side-channel silence, we always go for the worst case, no data-dependent condition to exit early as soon as g reaches 0). That implies that the constant for the O(n^2) part is larger, even if the O(n) part might be same or smaller. I think the inverse flavor is still rather neat, main loop is for (delta = 1;count > STEPS;count -= STEPS) { delta = steps (, STEPS, delta, fp[0], gp[0]); matrix_vector_mul (, STEPS, n+1, fp, gp, tp); matrix_vector_mul_mod (, Bp, n+2, up, vp, tp); } I.e., process low limbs to get a (signed) transform matrix. Apply matrix to numbers and cofactors. That's all. The case of large quotients, which has been messy in all previous variants, is handled incrementally by the "delta" counter, which will grow larger whenever we have to process a larger number of trailing zeros, and in that case, we'll get more progress sometimes later. This version represents the cofactors too as two's complement. They are reduced mod m for each matrix multiply, so they don't grow large, and I'm thinking that it might be simpler and/or more efficient to instead keep them in unsigned representation. I'm thinking I'll try to implement and benchmark this for Nettle's ecc functions first, before attempting to update GMP function. The reason is that (i) I really want to do needed precomputations for all moduli of interest for Nettle at compile time, which would complicate the GMP interface if it is supported at all, and (ii) in some cases I want inversion to operate on numbers in montgomery representation, and then I'd like to fold any needed additional power of two into the precomputed constant. Regards, /Niels -8<--- /* Side channel silent gcd (work in progress) Copyright 2022 Niels Möller This is is free software; you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. The GNU MP Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with the GNU MP Library. If not, see https://www.gnu.org/licenses/. */ #include #include #include #include #include #include #if GMP_NUMB_BITS < GMP_LIMB_BITS #error Nails not supported. #endif #define BITCNT_BITS (sizeof(mp_bitcnt_t) * CHAR_BIT) #define STEPS (GMP_LIMB_BITS - 2) #define MAX(a,b) ((a) >= (b) ? (a) : (b)) struct matrix { /* Matrix elements interpreted as signed two's complement. Absolute value of elements is at most 2^STEPS. */ mp_limb_t a[2][2]; }; /* Conditionally set (a, b) <-- (b, -a) */ #define CND_NEG_SWAP_LIMB(cnd, a, b) do {\ mp_limb_t __cnd_sum = (a) + (b); \ mp_limb_t __cnd_diff = (a) - (b);\ (a) -= __cnd_diff & -(cnd); \ (b) -= __cnd_sum & -(cnd); \ } while (0) /* Perform STEPS elementary reduction steps on (delta, f, g). In the least significant STEPS bits of f, g matter. Note that mp_bitcnt_t is an unsigned
Re: Fast constant-time gcd computation and modular inversion
ni...@lysator.liu.se (Niels Möller) writes: count = (49 * bits + 57) / 17; Odd. (For production code, one will need to cap the intermediates there, at least for 32-bit machines. Perhaps something like this: count = (51 * bits - 2 * bits + 57) / 17 = = 3 * bits - (2 * bits - 57) / 17 About a factor 3 slower than plain mpz_gcd on my machine. It will be interesting to see if slowdown for modular inversion is about the same. Cool! I think your measurements are a bit optimistic, though. My measruments from slightly modified timing code suggest 4.5x slowdown, and more for really small operands. This will still completely outclass the current sec code. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
ni...@lysator.liu.se (Niels Möller) writes: > If we require the largest matrix element, 2^k, to fit in a 64-bit > signed, we can have at most k = 62. I've implemented a first version doing GMP_LIMB_BITS-2 bits at a time. For a start, gcd only, not cofactors/inverse. Code appended below. It's rather neat. To be compared to the lehmer loop in mpn/generic/gcd.c, which involves an iteration calling mpn_hgcd2 and mpn_gcd_subdiv_step. The analogue of hgcd2 is the "steps" function below, a straight loop using masking tricks for the conditions, no unpredictable branches. The most subtle part is the line count = (49 * bits + 57) / 17; Proving that this is enough for the algorithm to terminate with the gcd is what the bulk of the proofs in the paper is about. About a factor 3 slower than plain mpz_gcd on my machine. It will be interesting to see if slowdown for modular inversion is about the same. Regards, /Niels --8< /* Side channel silent gcd (work in progress) Copyright 2022 Niels Möller This is is free software; you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. The GNU MP Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with the GNU MP Library. If not, see https://www.gnu.org/licenses/. */ #include #include #include #include #include #include #if GMP_NUMB_BITS < GMP_LIMB_BITS #error Nails not supported. #endif #define BITCNT_BITS (sizeof(mp_bitcnt_t) * CHAR_BIT) #define STEPS (GMP_LIMB_BITS - 2) #define MAX(a,b) ((a) >= (b) ? (a) : (b)) struct matrix { /* Matrix elements interpreted as signed two's complement. Absolute value of elements is at most 2^STEPS. */ mp_limb_t a[2][2]; }; /* Conditionally set (a, b) <-- (b, -a) */ #define CND_NEG_SWAP_LIMB(cnd, a, b) do {\ mp_limb_t __cnd_sum = (a) + (b); \ mp_limb_t __cnd_diff = (a) - (b);\ (a) -= __cnd_diff & -(cnd); \ (b) -= __cnd_sum & -(cnd); \ } while (0) /* Perform STEPS elementary reduction steps on (delta, f, g). In the least significant STEPS bits of f, g matter. Note that mp_bitcnt_t is an unsigned type, but is used as two's complement. */ static mp_bitcnt_t steps(struct matrix *M, mp_bitcnt_t delta, mp_limb_t f, mp_limb_t g) { mp_limb_t a00, a01, a10, a11; unsigned i; assert (f & 1); /* Identity matrix. */ a00 = a11 = 1; a01 = a10 = 0; /* Preserve invariant (f ; g) = 2^{-i} M (f_orig, g_orig) */ for (i = 0; i < STEPS; i++, delta++) { mp_limb_t odd = g & 1; mp_limb_t swap = odd & ~(delta >> (BITCNT_BITS-1)); /* Swap; (f'; g') = (g; -f) = (0,1;-1,0) (g;f) */ CND_NEG_SWAP_LIMB(swap, f, g); CND_NEG_SWAP_LIMB(swap, a00, a10); CND_NEG_SWAP_LIMB(swap, a01, a11); /* Conditional negation. */ delta = (delta ^ - (mp_bitcnt_t) swap) + swap; /* Cancel low bit and shift. */ g += f & -odd; a10 += a00 & -odd; a11 += a01 & -odd; g >>= 1; a00 <<= 1; a01 <<= 1; } M->a[0][0] = a00; M->a[0][1] = a01; M->a[1][0] = a10; M->a[1][1] = a11; return delta; } /* Set R = (u * F + v * G) >> STEPS, treating all numbers as two's complement. No overlap allowed. */ static void add_add_mul_shift (mp_ptr rp, const mp_limb_t *gp, const mp_limb_t *fp, mp_size_t n, mp_limb_t u, mp_limb_t v) { mp_limb_t f_sign = fp[n-1] >> (GMP_LIMB_BITS - 1); mp_limb_t g_sign = gp[n-1] >> (GMP_LIMB_BITS - 1); mp_limb_t u_sign = u >> (GMP_LIMB_BITS - 1); mp_limb_t v_sign = v >> (GMP_LIMB_BITS - 1); mp_limb_t hi = mpn_mul_1 (rp, fp, n, u) - ((-f_sign) & u); mp_limb_t lo; hi -= ((-u_sign) & fp[n-1]) + mpn_cnd_sub_n (u_sign, rp + 1, rp + 1, fp, n-1); hi += mpn_addmul_1 (rp, gp, n, v) - ((-g_sign) & v); hi -= ((-v_sign) & gp[n-1]) + mpn_cnd_sub_n (v_sign, rp + 1, rp + 1, gp, n-1); lo = mpn_rshift (rp, rp, n, STEPS); assert (lo == 0); rp[n-1] += (hi << (GMP_LIMB_BITS - STEPS)); } /* Update (f'; g') = M (f; g), where all numbers are two's complement. */ static void matrix_vector_mul (const struct matrix *M, mp_size_t n, mp_limb_t *fp, mp_limb_t *gp, mp_limb_t *tp) { add_add_mul_shift (tp, gp, fp, n, M->a[0][0], M->a[0][1]); add_add_mul_shift (tp + n, gp, fp, n, M->a[1][0], M->a[1][1]); mpn_copyi (fp, tp, n); mpn_copyi (gp, tp + n, n); } static void
Re: Fast constant-time gcd computation and modular inversion
ni...@lysator.liu.se (Niels Möller) writes: > Then 96 bits seems to be the maximum number of low-end bits that can be > eliminated, under the constraint that elements of the corresponding > transformation matrix should fit in 64-bit (signed) limbs. I had another look into this (that is, the paper https://eprint.iacr.org/2019/266.pdf) today, thinking that some careful analysis could prove a bound like this. But I now think it's not possible, and taking the worst case into account means we can fill a 64-bit matrix with significantly less than 64 bits eliminated. To review, we examine the least significant k bits of a,b (a odd), construct a matrix (with integer elements, determinant ±2^k) that eliminates those least significant low bits. So we can set (a'; b') = 2^{-k} M (a; b) We then hope that elements of M are smaller than k bits, so we actually make progress. The paper proves progress, by splitting into groups representing 2-dic divisions and deriving bounds for the norm in a rather complicated way. But we don't have any such bound locally. E.g, consider the simple case that it turns out that the low k bits of b are all zeros (I'm not yet sure this is the worst case, but I hope it can't get much worse). Then the best we can do is a' = a, b' = b/2^k, i.e., (a'; b') = (a, b/2^k) = 2^{-k} (2^k, 0 ; 0, 1) (a; b) If we require the largest matrix element, 2^k, to fit in a 64-bit signed, we can have at most k = 62. The paper proves that we make sufficient progress on average, as the reductions continue, but seems that doesn't translate to progress for shorter segments of reductions. And maybe not so surprising, if we compare to the current Lehmer algorithm: Split of top two limbs, compute a matrix with single limb elements. Apply matrix, making roughly one limb of progress. Except when we encounter a really large quotient; then it can't be represented as a matrix with single-limb elements, and we instead need a large division, i.e., a larger step with larger progress. The new algorithm can't magically make that large quotient case vanish, the trick is that it can handle it incrementally, without special cases to cause side-channel leakage. But then those incremental steps will not make any clear progress on their own, only when combined to correspond to a large quotient. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
Torbjörn Granlund writes: > ni...@lysator.liu.se (Niels Möller) writes: > > Extract least significant 96 bits of each number. > > Is that 3 32-bit limbs or 1.5 64-bit limbs? I was thinking about 64-bit archs. Then 96 bits seems to be the maximum number of low-end bits that can be eliminated, under the constraint that elements of the corresponding transformation matrix should fit in 64-bit (signed) limbs. And there's no harm in extracting two full limsb (128 bits), it's just that the transformation matrix doesn't depend in any way on those extra bits. For 32-bit transform matrix, one could do less (roughly half, not sure of it's precicely 48 bits). Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
ni...@lysator.liu.se (Niels Möller) writes: Extract least significant 96 bits of each number. Is that 3 32-bit limbs or 1.5 64-bit limbs? -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
Torbjörn Granlund writes: > Any algorithm with these properties would be a huge improvement compared > to what we have today: > > 0. No side-channel leakage of anything but bit counts of input operands. >(I suppose there are usages where the mod argument is not sensitive, >such as for the elliptic curve usages). This is already true for the >present, slow code. > > 1. Computation of a 2 x 2 matrix in O(1) time. Not necessarily fast. > > 2. Guaranteed reduction of intermediate operand bit count by a full limb >each. > > 3. If the reduction in (2) happens to be greater than one limb, the >subsequent iteration of (1) must still work. I.e., it must allow the >bits it depends on to be zero and should not need to locate, say, the >least significant 1 bit of the full operands. I think the new algorithm by djb and Bo-Yin Yang look very promising. I think it could work somthing like this: Extract least significant 96 bits of each number. Compute (in O(1)) a matrix that eliminates those bits. Matrix elements are 64-bit signed (it's not entirely clear from the proof in the paper that they can't overflow 64 bit signed, I've tried mailing the authors for clarification). Then multiply original numbers by that matrix, and shift left 96 bits. Then we have a net reduction of 32 bits. Due to signednes, we must treat everything as two's complement. To avoid actual shifts, we can group two such iterations, eliminating three 64-bit limbs at the low end, and growing by at most two limbs at the high end. The new clever trick to deal with issues like (3), is to extend algorithm state with an auxillary shift count. The resulting inverse will be off by a power of two, which needs to be eliminated with some final processing (or maybe treated specially, in case numbers are in redc form). Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
Any algorithm with these properties would be a huge improvement compared to what we have today: 0. No side-channel leakage of anything but bit counts of input operands. (I suppose there are usages where the mod argument is not sensitive, such as for the elliptic curve usages). This is already true for the present, slow code. 1. Computation of a 2 x 2 matrix in O(1) time. Not necessarily fast. 2. Guaranteed reduction of intermediate operand bit count by a full limb each. 3. If the reduction in (2) happens to be greater than one limb, the subsequent iteration of (1) must still work. I.e., it must allow the bits it depends on to be zero and should not need to locate, say, the least significant 1 bit of the full operands. -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
Albin Ahlbäck writes: > Have you looked at https://eprint.iacr.org/2020/972.pdf, where the > author seems to suggests an even faster algorithm? Or at least it was > faster under heavy optimization under the assumption of what inputs > the algorithm recieved. No, I wasn't ware of that. I've now had a quick look (algorithm 2, page 6, seems to be the main part). It's pretty close to standard extended binary, and like gmp's mpn_sec_invert, a single innerloop iteration is one conditional subtraction of the smaller number from the larger and right shift of a single bit. The trick (which is new to me) is to reduce k-1 bits by taking the least significant k-1 bits *and* most significant approx k+1 bits of both numbers, and then the innerloop operates only on these smaller nubmers, ignoring the middle bits. The iteration steps are collected into a transformation matrix. And then have an outerloop applying that transformation matrix to the complete numbers and the cofactors. I haven't read the correctness argument, but there seems to be some subtle issues: At line 3, n = max(len(a), len(b), 2k) makes n data dependant, which in turn makes side-channel silent extraction of top bits, floor (a / 2^{n-k-1}) a bit tricky, since the way these bits straddles a boundaries becomes data dependent. And the need for the conditional negations (lines 18 -- 21) seems related to rounding errors from ignored middle bits. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
On 24/05/22 11:59, Niels Möller wrote: > Hi, I've had a first look at the paper by djb and Bo-Yin Yang, > https://eprint.iacr.org/2019/266.pdf. Mainly focusing on the integer > case. Have you looked at https://eprint.iacr.org/2020/972.pdf, where the author seems to suggests an even faster algorithm? Or at least it was faster under heavy optimization under the assumption of what inputs the algorithm recieved. Best, Albin ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Fast constant-time gcd computation and modular inversion
ni...@lysator.liu.se (Niels Möller) writes: > The new algorithm in the above paper uses a small auxillary state > variable d (or \delta in the paper) to decide the update in the case > both a, b are odd. If I write the iteration step using the same notation > as above, it's > > if a even: > a <-- a / 2, d <-- 1 + d > > else if d <= 0 > a <-- (a + b) / 2, d <-- 1 + d > > else // d > 0 > [a, b] <-- [(b - a)/2, a], d <-- 1 - d > > So conditions depend only on the sign of the auxillary d, and the least > significant bit of a. [...] > In which way the decision logic based on sign of d guarantees progress > is not clear to me (I haven't tried to read all the proofs), but I guess > d should be understood as an approximation of bitsize(b) - bitsize(a) The role of d is explained by the proof of G.1 (page 55 in the paper). It can be understood as just a shift count. To reduce the risk that I mess up, let's use the paper's notation. Then the iteration operates on (d, f, g), f always odd. if g even: g <-- g / 2, d <-- 1 + d else if d <= 0: g <-- (g + f) / 2, d <-- 1 + d else // d > 0 [f, g] <-- [g, (g - f)/2], d <-- 1 - d In the paper's notation, say f is odd, g is even, g = 2^e g' with g odd, e >= 1, and we start the algorithm with (d_0, f_0, g_0) = (1, f, g/2) In the first e-1 steps, we take out the powers of 2 from g, and get (d_{e-1}, f_{e-1}, g_{e-1}) = (e, f, g') Then we get to the both odd state and d > 0, so we get (d_e, f_e, g_e) = (1-e, g', (g' - f) / 2) Now, d_e <= 0, and it remains so for the next e-1 steps. These steps, we keep reducing the (g' - f) term, adding g' whenever we have an odd value. After a total of 2e steps, we're back at d = 1, and (d_{2e}, f_{2e}, g_{2e}) = (1, g', (q g' - f)/2^e) where q is an odd number in the range {1, 3, 5, ..., 2^{k+1} - 1). So result after these 2e steps is that we're back at the initial value d = 1. And the update of f and g is very similar to the 2-adic division step in the Stehlé-Zimmermann algorithm (book-keeping of the powers of two slightly different, there's a sign change, and quotient interval isn't symmetric around zero). By decomposing the 2-adic division into these 2e steps, the steps can be implemented in a constant time fashion. And one gets a lot of flexibility on how to split off the least significant part to work on. Bounds on worst case (which determines the iteration count when using it for constant-time inverse) is based on computer aided proofs. The paper also claims that this choice of quotient interval rather than the symmetric interval {1 - 2^e, 3 - e^1, ..., -1, 1, ... 2^e-1} of Stehlé-Zimmermann gives a better worst-case bound, and that worst-case analysis in that paper isn't quite right. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel