Re: Fast constant-time gcd computation and modular inversion

2022-09-29 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> I've now implemented inverse too.
 
And now I've tried a different variant, using redc for the cofactor
updates. Cofactors are now canonically reduced (which seems unexpectedly
cheap). In the case that m is not normalized, so that 2m fits in n
limbs, one could use a more relaxed range, 0 <= u, v < 2m, and get rid
of a conditional adjustment when doing redc in the loop.

Using redc is a rather good fit, because in each iteration, we shift f,
g *almost* a limb (62 bits on a 64-bit machine), except for the last
iteration where we do less bits. While cofactors are multiplied by
2^{-64} mod m in each iteration. So there's a discrepancy of a few bits
per iteration, which one can handle using very few extra redc calls
outside of the loop (it's O(n^2) work, but with a pretty small
constant).

Speed is rather close to the previous version, but the only needed
precomputation is binvert_limb for the redc. And the calls to
mpn_sec_mul and mpn_sec_div_r are gone.

I also added benchmarking of the existing mpn_sec_invert, for
comparison. This is how it looks now:

invert size 1, ref 0.327 (us), old 2.771 (us), djb 1.087 (us)
invert size 2, ref 0.757 (us), old 5.366 (us), djb 1.571 (us)
invert size 3, ref 1.082 (us), old 8.110 (us), djb 2.336 (us)
[...]
invert size 17, ref 5.106 (us), old 163.962 (us), djb 18.082 (us)
invert size 18, ref 5.189 (us), old 170.565 (us), djb 18.608 (us)
invert size 19, ref 5.715 (us), old 185.143 (us), djb 20.794 (us)

One possible optimization would be to keep cofactors in signed (two's
complement) representation for the first few iterations, when they are
still small and don't need any reduction. And cofactor update for the
very first iteration could be much more efficient.

Regards,
/Niels

/* Side channel silent gcd (work in progress)

Copyright 2022 Niels Möller

This is is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your
option) any later version.

or

  * the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any
later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.  */

#include 
#include 
#include 
#include 
#include 

#include 

#if GMP_NUMB_BITS < GMP_LIMB_BITS
#error Nails not supported.
#endif

#define BITCNT_BITS (sizeof(mp_bitcnt_t) * CHAR_BIT)

#define STEPS (GMP_LIMB_BITS - 2)

#define MAX(a,b) ((a) >= (b) ? (a) : (b))

struct matrix {
  /* Matrix elements interpreted as signed two's complement. Absolute
 value of elements is at most 2^STEPS. */
  mp_limb_t a[2][2];
};

/* Conditionally set (a, b) <-- (b, -a) */
#define CND_NEG_SWAP_LIMB(cnd, a, b) do {\
mp_limb_t __cnd_sum = (a) + (b);	 \
mp_limb_t __cnd_diff = (a) - (b);	 \
(a) -= __cnd_diff & -(cnd);		 \
(b) -= __cnd_sum & -(cnd);		 \
  } while (0)

/* Perform STEPS elementary reduction steps on (delta, f, g). In the
   least significant STEPS bits of f, g matter. Note that mp_bitcnt_t
   is an unsigned type, but is used as two's complement. */
static mp_bitcnt_t
steps(struct matrix *M, unsigned k, mp_bitcnt_t delta, mp_limb_t f, mp_limb_t g)
{
  mp_limb_t a00, a01, a10, a11;
  assert (f & 1);

  /* Identity matrix. */
  a00 = a11 = 1;
  a01 = a10 = 0;

  /* Preserve invariant (f ; g) = 2^{-i} M (f_orig, g_orig) */
  for (; k-- > 0; delta++)
{
  mp_limb_t odd = g & 1;
  mp_limb_t swap = odd & ~(delta >> (BITCNT_BITS-1));

  /* Swap; (f'; g') = (g; -f) = (0,1;-1,0) (g;f) */
  CND_NEG_SWAP_LIMB(swap, f, g);
  CND_NEG_SWAP_LIMB(swap, a00, a10);
  CND_NEG_SWAP_LIMB(swap, a01, a11);

  /* Conditional negation. */
  delta = (delta ^ - (mp_bitcnt_t) swap) + swap;

  /* Cancel low bit and shift. */
  g += f & -odd;
  a10 += a00 & -odd;
  a11 += a01 & -odd;

  g >>= 1;
  a00 <<= 1;
  a01 <<= 1;
}
  M->a[0][0] = a00; M->a[0][1] = a01;
  M->a[1][0] = a10; M->a[1][1] = a11;
  return delta;
}

/* Set R = (u * F + v * G), treating all numbers as two's complement.
   No overlap allowed. */
static mp_limb_t
add_add_mul (mp_ptr rp, const mp_limb_t *fp, const mp_limb_t *gp, mp_size_t n,
	 mp_limb_t u, mp_limb_t v) {
  mp_limb_t f_sign = fp[n-1] >> (GMP_LIMB_BITS - 1);
  mp_limb_t g_sign = gp[n-1] >> (GMP_LIMB_BITS - 1);
  mp_limb_t u_sign = u >> (GMP_LIMB_BITS - 1);
  mp_limb_t v_sign = v >> (GMP_LIMB_BITS - 1);
  mp_limb_t hi = mpn_mul_1 

Re: Fast constant-time gcd computation and modular inversion

2022-09-04 Thread Torbjörn Granlund
Marco Bodrato  writes:

  We should start writing mpn_sec_binvert :-)

I think mpn_binvert is almost sec_ naturally.

The exception is when sbpi1_bdiv_q.or dbpi1_bdiv_q c are invoked.  The
former has some < on data (for carry computations) and the latter has a
mpn_incr_u which is very leaky.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-09-03 Thread Marco Bodrato

Il 2022-09-01 17:04 Torbjörn Granlund ha scritto:

/* FIXME: Using mpz_invert is cheating. Instead, first compute m' =
   m^-1 (mod 2^k) via Newton/Hensel. We can then get the inverse 
via


   2^{-k} (mod m) = (2^k - m') * m + 1)/2^k. */
mpz_invert (t, t, m);
mpn_copyi (info->ip, mpz_limbs_read (t), mpz_size (t));

You might want to use mpn_binvert here.


We should start writing mpn_sec_binvert :-)
Or use the current mpn_sec_invert for the precomputation.

Ĝis,
m
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-09-01 Thread Torbjörn Granlund
/* FIXME: Using mpz_invert is cheating. Instead, first compute m' =
   m^-1 (mod 2^k) via Newton/Hensel. We can then get the inverse via

   2^{-k} (mod m) = (2^k - m') * m + 1)/2^k. */
mpz_invert (t, t, m);
mpn_copyi (info->ip, mpz_limbs_read (t), mpz_size (t));

You might want to use mpn_binvert here.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-08-31 Thread Niels Möller
Torbjörn Granlund  writes:

> Why do you use sec_invert when inverting mod the group order when that
> is of prime order?  (Yes, this question will become moot I suppose with
> this new algorithm.

No good reason, it's just that I implemented inverse-by-powering (with a
hand-tuned addition chain) as a side effect of implementing square root,
since in some cases they can share much of the addition chain, and that
work touched field prime arithmetic only.

Sorry we're getting a bit off topic, we should take nettle discussion
elsewhere.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-08-31 Thread Torbjörn Granlund
  Much more unclear to me how close it might be to the typical or average
  number of iterations needed.

That's perhaps not very interesting, as early exit is not an option
here.  (Unless this algorithm would beat plain, "leaky" inverse.)

  Currently uses exponentiation for the field inverse, and nettle's
  version of sec_invert for inverses mod the group order, as needed by
  ecdsa. Except for the somewhat obscure gost curves (Russian standards),
  where sec_invert is used for all inverses, and A/B for gost-gc512a is
  almost 30%.

Why do you use sec_invert when inverting mod the group order when that
is of prime order?  (Yes, this question will become moot I suppose with
this new algorithm.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-08-31 Thread Niels Möller
Torbjörn Granlund  writes:

> That count is a proven "upper bound" of the numbver of iterations.  Does
> the paper discuss how tight it is, i.e., is it close to a "lower bound"
> of the required number of iterations.

As an upper bound, I think it's rather tight at least for smallish
sizes, since proof includes an exhastive search up to some small limit
on bit size, and the results from that guide the approach for the
general proof. (It's rather complex, I don't get all the details).

Much more unclear to me how close it might be to the typical or average
number of iterations needed.

> Comparing to the existing side-channel-silent inversion code would 
> impress more, I think.

That would be a very relevant benchmark, yes. Omitted in this version
just because interface is a bit different from what's used in the prototype.

> What percentage of Ec scalar mul is used for the final inversion?  Not
> much, I suppose. 

After a quick look at my benchmark numbers, if one compares the time for 

A. inversion modulo the field prime, and

B. scalar multiplication using the fixed generator point, and with
   output in projective or jacobian coordinates (i.e., inverse free),

then ratio A/B is in the range 10% (smaller curves) to 16% for ed448.

> Does Nettle use exponentiation there today, or
> mpn_sec_invert?

Currently uses exponentiation for the field inverse, and nettle's
version of sec_invert for inverses mod the group order, as needed by
ecdsa. Except for the somewhat obscure gost curves (Russian standards),
where sec_invert is used for all inverses, and A/B for gost-gc512a is
almost 30%.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-08-31 Thread Torbjörn Granlund
  > count = (49 * bits + 57) / 17;
  >
  > Odd.

  For sure. This isn't based on local progress of the algorithm (there
  ain't no guaranteed progress for a short sequence of reduction steps),
  but based on rather complex analysis of the number of steps needed for
  the complete 2-adic quotient sequence.

That count is a proven "upper bound" of the numbver of iterations.  Does
the paper discuss how tight it is, i.e., is it close to a "lower bound"
of the required number of iterations.

  > I think your measurements are a bit optimistic, though.  My measruments
  > from slightly modified timing code suggest 4.5x slowdown, and more for
  > really small operands.

  Maybe, I tried to keep the timing really simple and portable.

I simply up'ed the number of iterations (actually made them depend in
the operand size).

  I've now implemented inverse too. See updated code below. When I run it,
  I get

  invert size 1, ref 0.293 (us), djb 1.145 (us)
  invert size 2, ref 0.721 (us), djb 1.659 (us)
  invert size 3, ref 0.994 (us), djb 2.375 (us)
  [...]
  invert size 17, ref 5.157 (us), djb 18.538 (us)
  invert size 18, ref 5.207 (us), djb 19.064 (us)
  invert size 19, ref 5.702 (us), djb 21.286 (us)

  so a slowdown of 3--4 compared to mpz_invert. This timing excludes the
  precomputation of a few needed per-modulo constants.

Very very nice!

People not famililiar with what we're trying to do here might be
unimpressed with these results.  "Look, I've written new code which is 4
times slower than what we have, HURRAY!"

Comparing to the existing side-channel-silent inversion code would 
impress more, I think.

  I think the inverse flavor is still rather neat, main loop is

for (delta = 1;count > STEPS;count -= STEPS)
  {
delta = steps (, STEPS, delta, fp[0], gp[0]);
matrix_vector_mul (, STEPS, n+1, fp, gp, tp);
matrix_vector_mul_mod (, Bp, n+2, up, vp, tp);
  }

I can make it even neater:

for (delta = 1;count > STEPS;count -= STEPS)
  {
do_it_all (fp, gp, up, vp, n);
  }

:-)

  I'm thinking I'll try to implement and benchmark this for Nettle's ecc
  functions first, before attempting to update GMP function. The reason is
  that (i) I really want to do needed precomputations for all moduli of
  interest for Nettle at compile time, which would complicate the GMP
  interface if it is supported at all, and (ii) in some cases I want
  inversion to operate on numbers in montgomery representation, and then
  I'd like to fold any needed additional power of two into the precomputed
  constant.

What percentage of Ec scalar mul is used for the final inversion?  Not
much, I suppose.  Does Nettle use exponentiation there today, or
mpn_sec_invert?


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-08-30 Thread Niels Möller
Torbjörn Granlund  writes:

> ni...@lysator.liu.se (Niels Möller) writes:
>
> count = (49 * bits + 57) / 17;
>
> Odd.

For sure. This isn't based on local progress of the algorithm (there
ain't no guaranteed progress for a short sequence of reduction steps),
but based on rather complex analysis of the number of steps needed for
the complete 2-adic quotient sequence.

> I think your measurements are a bit optimistic, though.  My measruments
> from slightly modified timing code suggest 4.5x slowdown, and more for
> really small operands.

Maybe, I tried to keep the timing really simple and portable.

> This will still completely outclass the current sec code.

I've now implemented inverse too. See updated code below. When I run it,
I get

invert size 1, ref 0.293 (us), djb 1.145 (us)
invert size 2, ref 0.721 (us), djb 1.659 (us)
invert size 3, ref 0.994 (us), djb 2.375 (us)
[...]
invert size 17, ref 5.157 (us), djb 18.538 (us)
invert size 18, ref 5.207 (us), djb 19.064 (us)
invert size 19, ref 5.702 (us), djb 21.286 (us)

so a slowdown of 3--4 compared to mpz_invert. This timing excludes the
precomputation of a few needed per-modulo constants.

I haven't digged deeper into the performance, but I would expect that
the "steps" function is significantly faster than hgcd2, since it's
straight and simple code, with no unpredictable branches. But since
average progress for each computed transformation is less, we'll need a
larger number of iterations in the outer loop (and for side-channel
silence, we always go for the worst case, no data-dependent condition to
exit early as soon as g reaches 0). That implies that the constant for
the O(n^2) part is larger, even if the O(n) part might be same or smaller.

I think the inverse flavor is still rather neat, main loop is

  for (delta = 1;count > STEPS;count -= STEPS)
{
  delta = steps (, STEPS, delta, fp[0], gp[0]);
  matrix_vector_mul (, STEPS, n+1, fp, gp, tp);
  matrix_vector_mul_mod (, Bp, n+2, up, vp, tp);
}

I.e., process low limbs to get a (signed) transform matrix. Apply matrix
to numbers and cofactors. That's all. The case of large quotients, which
has been messy in all previous variants, is handled incrementally by the
"delta" counter, which will grow larger whenever we have to process a
larger number of trailing zeros, and in that case, we'll get more
progress sometimes later.

This version represents the cofactors too as two's complement. They are
reduced mod m for each matrix multiply, so they don't grow large, and
I'm thinking that it might be simpler and/or more efficient to instead
keep them in unsigned representation.

I'm thinking I'll try to implement and benchmark this for Nettle's ecc
functions first, before attempting to update GMP function. The reason is
that (i) I really want to do needed precomputations for all moduli of
interest for Nettle at compile time, which would complicate the GMP
interface if it is supported at all, and (ii) in some cases I want
inversion to operate on numbers in montgomery representation, and then
I'd like to fold any needed additional power of two into the precomputed
constant.

Regards,
/Niels

-8<---

/* Side channel silent gcd (work in progress)

Copyright 2022 Niels Möller

This is is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your
option) any later version.

or

  * the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any
later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.  */

#include 
#include 
#include 
#include 
#include 

#include 

#if GMP_NUMB_BITS < GMP_LIMB_BITS
#error Nails not supported.
#endif

#define BITCNT_BITS (sizeof(mp_bitcnt_t) * CHAR_BIT)

#define STEPS (GMP_LIMB_BITS - 2)

#define MAX(a,b) ((a) >= (b) ? (a) : (b))

struct matrix {
  /* Matrix elements interpreted as signed two's complement. Absolute
 value of elements is at most 2^STEPS. */
  mp_limb_t a[2][2];
};

/* Conditionally set (a, b) <-- (b, -a) */
#define CND_NEG_SWAP_LIMB(cnd, a, b) do {\
mp_limb_t __cnd_sum = (a) + (b); \
mp_limb_t __cnd_diff = (a) - (b);\
(a) -= __cnd_diff & -(cnd);  \
(b) -= __cnd_sum & -(cnd);   \
  } while (0)

/* Perform STEPS elementary reduction steps on (delta, f, g). In the
   least significant STEPS bits of f, g matter. Note that mp_bitcnt_t
   is an unsigned 

Re: Fast constant-time gcd computation and modular inversion

2022-08-24 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

count = (49 * bits + 57) / 17;

Odd.

(For production code, one will need to cap the intermediates there,
at least for 32-bit machines.  Perhaps something like this:

count = (51 * bits - 2 * bits + 57) / 17 =
  = 3 * bits  - (2 * bits - 57) / 17

  About a factor 3 slower than plain mpz_gcd on my machine. It will be
  interesting to see if slowdown for modular inversion is about the same.

Cool!

I think your measurements are a bit optimistic, though.  My measruments
from slightly modified timing code suggest 4.5x slowdown, and more for
really small operands.

This will still completely outclass the current sec code.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-08-24 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> If we require the largest matrix element, 2^k, to fit in a 64-bit
> signed, we can have at most k = 62.

I've implemented a first version doing GMP_LIMB_BITS-2 bits at a time.
For a start, gcd only, not cofactors/inverse. Code appended below.

It's rather neat. To be compared to the lehmer loop in
mpn/generic/gcd.c, which involves an iteration calling mpn_hgcd2 and
mpn_gcd_subdiv_step.

The analogue of hgcd2 is the "steps" function below, a straight loop
using masking tricks for the conditions, no unpredictable branches.

The most subtle part is the line

  count = (49 * bits + 57) / 17;

Proving that this is enough for the algorithm to terminate with the gcd
is what the bulk of the proofs in the paper is about.

About a factor 3 slower than plain mpz_gcd on my machine. It will be
interesting to see if slowdown for modular inversion is about the same.

Regards,
/Niels

--8<
/* Side channel silent gcd (work in progress)

Copyright 2022 Niels Möller

This is is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your
option) any later version.

or

  * the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any
later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.  */

#include 
#include 
#include 
#include 
#include 

#include 

#if GMP_NUMB_BITS < GMP_LIMB_BITS
#error Nails not supported.
#endif

#define BITCNT_BITS (sizeof(mp_bitcnt_t) * CHAR_BIT)

#define STEPS (GMP_LIMB_BITS - 2)

#define MAX(a,b) ((a) >= (b) ? (a) : (b))

struct matrix {
  /* Matrix elements interpreted as signed two's complement. Absolute
 value of elements is at most 2^STEPS. */
  mp_limb_t a[2][2];
};

/* Conditionally set (a, b) <-- (b, -a) */
#define CND_NEG_SWAP_LIMB(cnd, a, b) do {\
mp_limb_t __cnd_sum = (a) + (b); \
mp_limb_t __cnd_diff = (a) - (b);\
(a) -= __cnd_diff & -(cnd);  \
(b) -= __cnd_sum & -(cnd);   \
  } while (0)

/* Perform STEPS elementary reduction steps on (delta, f, g). In the
   least significant STEPS bits of f, g matter. Note that mp_bitcnt_t
   is an unsigned type, but is used as two's complement. */
static mp_bitcnt_t
steps(struct matrix *M, mp_bitcnt_t delta, mp_limb_t f, mp_limb_t g)
{
  mp_limb_t a00, a01, a10, a11;
  unsigned i;
  assert (f & 1);

  /* Identity matrix. */
  a00 = a11 = 1;
  a01 = a10 = 0;

  /* Preserve invariant (f ; g) = 2^{-i} M (f_orig, g_orig) */
  for (i = 0; i < STEPS; i++, delta++)
{
  mp_limb_t odd = g & 1;
  mp_limb_t swap = odd & ~(delta >> (BITCNT_BITS-1));

  /* Swap; (f'; g') = (g; -f) = (0,1;-1,0) (g;f) */
  CND_NEG_SWAP_LIMB(swap, f, g);
  CND_NEG_SWAP_LIMB(swap, a00, a10);
  CND_NEG_SWAP_LIMB(swap, a01, a11);

  /* Conditional negation. */
  delta = (delta ^ - (mp_bitcnt_t) swap) + swap;

  /* Cancel low bit and shift. */
  g += f & -odd;
  a10 += a00 & -odd;
  a11 += a01 & -odd;

  g >>= 1;
  a00 <<= 1;
  a01 <<= 1;
}
  M->a[0][0] = a00; M->a[0][1] = a01;
  M->a[1][0] = a10; M->a[1][1] = a11;
  return delta;
}

/* Set R = (u * F + v * G) >> STEPS, treating all numbers as two's
   complement. No overlap allowed. */
static void
add_add_mul_shift (mp_ptr rp, const mp_limb_t *gp, const mp_limb_t *fp, 
mp_size_t n,
   mp_limb_t u, mp_limb_t v) {
  mp_limb_t f_sign = fp[n-1] >> (GMP_LIMB_BITS - 1);
  mp_limb_t g_sign = gp[n-1] >> (GMP_LIMB_BITS - 1);
  mp_limb_t u_sign = u >> (GMP_LIMB_BITS - 1);
  mp_limb_t v_sign = v >> (GMP_LIMB_BITS - 1);
  mp_limb_t hi = mpn_mul_1 (rp, fp, n, u) - ((-f_sign) & u);
  mp_limb_t lo;
  hi -= ((-u_sign) & fp[n-1]) + mpn_cnd_sub_n (u_sign, rp + 1, rp + 1, fp, n-1);

  hi += mpn_addmul_1 (rp, gp, n, v) - ((-g_sign) & v);
  hi -= ((-v_sign) & gp[n-1]) + mpn_cnd_sub_n (v_sign, rp + 1, rp + 1, gp, n-1);

  lo = mpn_rshift (rp, rp, n, STEPS);
  assert (lo == 0);
  rp[n-1] += (hi << (GMP_LIMB_BITS - STEPS));
}

/* Update (f'; g') = M (f; g), where all numbers are two's complement. */
static void
matrix_vector_mul (const struct matrix *M, mp_size_t n, mp_limb_t *fp, 
mp_limb_t *gp, mp_limb_t *tp)
{
  add_add_mul_shift (tp, gp, fp, n, M->a[0][0], M->a[0][1]);
  add_add_mul_shift (tp + n, gp, fp, n, M->a[1][0], M->a[1][1]);
  mpn_copyi (fp, tp, n);
  mpn_copyi (gp, tp + n, n);
}

static void 

Re: Fast constant-time gcd computation and modular inversion

2022-08-23 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> Then 96 bits seems to be the maximum number of low-end bits that can be
> eliminated, under the constraint that elements of the corresponding
> transformation matrix should fit in 64-bit (signed) limbs.

I had another look into this (that is, the paper
https://eprint.iacr.org/2019/266.pdf) today, thinking that some careful
analysis could prove a bound like this. But I now think it's not
possible, and taking the worst case into account means we can fill a
64-bit matrix with significantly less than 64 bits eliminated.

To review, we examine the least significant k bits of a,b (a odd),
construct a matrix (with integer elements, determinant ±2^k) that
eliminates those least significant low bits. So we can set

  (a'; b') = 2^{-k} M (a; b)

We then hope that elements of M are smaller than k bits, so we actually
make progress. The paper proves progress, by splitting into groups
representing 2-dic divisions and deriving bounds for the norm in a
rather complicated way.

But we don't have any such bound locally. E.g, consider the simple case
that it turns out that the low k bits of b are all zeros (I'm not yet
sure this is the worst case, but I hope it can't get much worse).

Then the best we can do is a' = a, b' = b/2^k, i.e.,

  (a'; b') = (a, b/2^k) = 2^{-k} (2^k, 0 ; 0, 1) (a; b)

If we require the largest matrix element, 2^k, to fit in a 64-bit
signed, we can have at most k = 62.

The paper proves that we make sufficient progress on average, as the
reductions continue, but seems that doesn't translate to progress for
shorter segments of reductions.

And maybe not so surprising, if we compare to the current Lehmer
algorithm: Split of top two limbs, compute a matrix with single limb
elements. Apply matrix, making roughly one limb of progress. Except when
we encounter a really large quotient; then it can't be represented as a
matrix with single-limb elements, and we instead need a large division,
i.e., a larger step with larger progress.

The new algorithm can't magically make that large quotient case vanish,
the trick is that it can handle it incrementally, without special cases
to cause side-channel leakage. But then those incremental steps will not
make any clear progress on their own, only when combined to correspond
to a large quotient.
 
Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-06-06 Thread Niels Möller
Torbjörn Granlund  writes:

> ni...@lysator.liu.se (Niels Möller) writes:
>
>   Extract least significant 96 bits of each number.
>
> Is that 3 32-bit limbs or 1.5 64-bit limbs?

I was thinking about 64-bit archs.

Then 96 bits seems to be the maximum number of low-end bits that can be
eliminated, under the constraint that elements of the corresponding
transformation matrix should fit in 64-bit (signed) limbs.

And there's no harm in extracting two full limsb (128 bits), it's just
that the transformation matrix doesn't depend in any way on those extra
bits.

For 32-bit transform matrix, one could do less (roughly half, not sure
of it's precicely 48 bits).

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-06-06 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  Extract least significant 96 bits of each number.

Is that 3 32-bit limbs or 1.5 64-bit limbs?

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-06-06 Thread Niels Möller
Torbjörn Granlund  writes:

> Any algorithm with these properties would be a huge improvement compared
> to what we have today:
>
> 0. No side-channel leakage of anything but bit counts of input operands.
>(I suppose there are usages where the mod argument is not sensitive,
>such as for the elliptic curve usages).  This is already true for the
>present, slow code.
>
> 1. Computation of a 2 x 2 matrix in O(1) time.  Not necessarily fast.
>
> 2. Guaranteed reduction of intermediate operand bit count by a full limb
>each.
>
> 3. If the reduction in (2) happens to be greater than one limb, the
>subsequent iteration of (1) must still work.  I.e., it must allow the
>bits it depends on to be zero and should not need to locate, say, the
>least significant 1 bit of the full operands.

I think the new algorithm by djb and Bo-Yin Yang look very promising. I
think it could work somthing like this:

Extract least significant 96 bits of each number.

Compute (in O(1)) a matrix that eliminates those bits. Matrix elements are
64-bit signed (it's not entirely clear from the proof in the paper that
they can't overflow 64 bit signed, I've tried mailing the authors for
clarification).

Then multiply original numbers by that matrix, and shift left 96 bits.
Then we have a net reduction of 32 bits. Due to signednes, we must treat
everything as two's complement.

To avoid actual shifts, we can group two such iterations, eliminating
three 64-bit limbs at the low end, and growing by at most two limbs at
the high end.

The new clever trick to deal with issues like (3), is to extend
algorithm state with an auxillary shift count.

The resulting inverse will be off by a power of two, which needs to be
eliminated with some final processing (or maybe treated specially, in
case numbers are in redc form).

Regards,
/Niels
 
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-06-03 Thread Torbjörn Granlund
Any algorithm with these properties would be a huge improvement compared
to what we have today:

0. No side-channel leakage of anything but bit counts of input operands.
   (I suppose there are usages where the mod argument is not sensitive,
   such as for the elliptic curve usages).  This is already true for the
   present, slow code.

1. Computation of a 2 x 2 matrix in O(1) time.  Not necessarily fast.

2. Guaranteed reduction of intermediate operand bit count by a full limb
   each.

3. If the reduction in (2) happens to be greater than one limb, the
   subsequent iteration of (1) must still work.  I.e., it must allow the
   bits it depends on to be zero and should not need to locate, say, the
   least significant 1 bit of the full operands.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-05-29 Thread Niels Möller
Albin Ahlbäck  writes:

> Have you looked at https://eprint.iacr.org/2020/972.pdf, where the
> author seems to suggests an even faster algorithm? Or at least it was
> faster under heavy optimization under the assumption of what inputs
> the algorithm recieved.

No, I wasn't ware of that. I've now had a quick look (algorithm 2, page
6, seems to be the main part).

It's pretty close to standard extended binary, and like gmp's
mpn_sec_invert, a single innerloop iteration is one conditional
subtraction of the smaller number from the larger and right shift of a
single bit.

The trick (which is new to me) is to reduce k-1 bits by taking the least
significant k-1 bits *and* most significant approx k+1 bits of both
numbers, and then the innerloop operates only on these smaller nubmers,
ignoring the middle bits. The iteration steps are collected into a
transformation matrix.

And then have an outerloop applying that transformation matrix to the
complete numbers and the cofactors.

I haven't read the correctness argument, but there seems to be some
subtle issues: At line 3,

  n = max(len(a), len(b), 2k)

makes n data dependant, which in turn makes side-channel silent
extraction of top bits, floor (a / 2^{n-k-1}) a bit tricky, since the
way these bits straddles a boundaries becomes data dependent.

And the need for the conditional negations (lines 18 -- 21) seems
related to rounding errors from ignored middle bits.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-05-28 Thread Albin Ahlbäck

On 24/05/22 11:59, Niels Möller wrote:
> Hi, I've had a first look at the paper by djb and Bo-Yin Yang,
> https://eprint.iacr.org/2019/266.pdf. Mainly focusing on the integer
> case.

Have you looked at https://eprint.iacr.org/2020/972.pdf, where the 
author seems to suggests an even faster algorithm? Or at least it was 
faster under heavy optimization under the assumption of what inputs the 
algorithm recieved.


Best,
Albin
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Fast constant-time gcd computation and modular inversion

2022-05-24 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> The new algorithm in the above paper uses a small auxillary state
> variable d (or \delta in the paper) to decide the update in the case
> both a, b are odd. If I write the iteration step using the same notation
> as above, it's
>
>   if a even:
> a <-- a / 2, d <-- 1 + d
>
>   else if d <= 0   
> a <-- (a + b) / 2, d <-- 1 + d
>
>   else // d > 0
> [a, b] <-- [(b - a)/2, a], d <-- 1 - d 
>
> So conditions depend only on the sign of the auxillary d, and the least
> significant bit of a.

[...]

> In which way the decision logic based on sign of d guarantees progress
> is not clear to me (I haven't tried to read all the proofs), but I guess
> d should be understood as an approximation of bitsize(b) - bitsize(a)

The role of d is explained by the proof of G.1 (page 55 in the paper).
It can be understood as just a shift count.

To reduce the risk that I mess up, let's use the paper's notation. Then
the iteration operates on (d, f, g), f always odd.

  if g even:
g <-- g / 2, d <-- 1 + d

  else if d <= 0:
g <-- (g + f) / 2, d <-- 1 + d

  else // d > 0
[f, g] <-- [g, (g - f)/2], d <-- 1 - d

In the paper's notation, say f is odd, g is even, g = 2^e g' with g odd,
e >= 1, and we start the algorithm with

  (d_0, f_0, g_0) = (1, f, g/2)

In the first e-1 steps, we take out the powers of 2 from g, and get

  (d_{e-1}, f_{e-1}, g_{e-1}) = (e, f, g')

Then we get to the both odd state and d > 0, so we get

  (d_e, f_e, g_e) = (1-e, g', (g' - f) / 2)

Now, d_e <= 0, and it remains so for the next e-1 steps. These steps, we
keep reducing the (g' - f) term, adding g' whenever we have an odd value.
After a total of 2e steps, we're back at d = 1, and

  (d_{2e}, f_{2e}, g_{2e}) = (1, g', (q g' - f)/2^e)

where q is an odd number in the range {1, 3, 5, ..., 2^{k+1} - 1).

So result after these 2e steps is that we're back at the initial value d
= 1. And the update of f and g is very similar to the 2-adic division
step in the Stehlé-Zimmermann algorithm (book-keeping of the powers of
two slightly different, there's a sign change, and quotient interval
isn't symmetric around zero).

By decomposing the 2-adic division into these 2e steps, the steps can be
implemented in a constant time fashion. And one gets a lot of flexibility
on how to split off the least significant part to work on.

Bounds on worst case (which determines the iteration count when using it
for constant-time inverse) is based on computer aided proofs. The paper
also claims that this choice of quotient interval rather than the
symmetric interval {1 - 2^e, 3 - e^1, ..., -1, 1, ... 2^e-1} of
Stehlé-Zimmermann gives a better worst-case bound, and that worst-case
analysis in that paper isn't quite right.

Regards,
/Niels

-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel