Hi all,

@ Matt: Your patch improved it by just a second or two on my rig, so that isn't the bottleneck in my case.

@ Peter: I have trouble compiling your code:
In file included from fp_mul_comba.c:365:
fp_mul_comba_small_set.i: In function `fp_mul_comba_small':
fp_mul_comba_small_set.i:1225: internal error--unrecognizable insn:
(insn 91872 37787 37788 (set (mem:SI (plus:SI (plus:SI (reg:SI 11 fp)
                    (const_int -4096 [0xfffff000]))
                (const_int -740 [0xfffffd1c])) 0)
        (reg:SI 7 r7)) -1 (nil)
    (nil))
make[1]: *** [fp_mul_comba.o] Error 1
It's likely due to my configuration, but could also be a compatibility issue in your code. Just thought you might want to know.

Unfortunately I'm pressed for time and must go with openssl for the time being. It's a monster size-wise, but at 10s it's usable (just). If space becomes an even more of a factor, be assured however that I will revisit dropbear.

Thank you all for your input and efforts to help.

Kind regards/Magnus

On 2011-03-17 16:23, Matt Johnston wrote:
On Wed, Mar 16, 2011 at 07:16:34PM -0500, Rob Landley wrote:
On 03/16/2011 02:25 AM, Peter Turczak wrote:
Hi Magnus, hi Rob,

a while ago I made the same observations you did. On an m68k-nommu
with 166 MHz the RSA exchange took quite forever. After some
profiling I found out the comba multiply routine in libtommath was
eating most of the time. It seems gcc produces quite inefficient code
there. Libtommath resizes its large integers while calculating
leading to more work for user memory management.
User mememory management?  It's got a malloc/free in an inner loop?  BARF!

(Yeah, that'll blow your L1 cache wide open and slow stuff down by at
least an order of magnitude.  Allocation functions are some of the most
cache unfriendly things you can do, pretty much by definition.  Unused
memory is not cache hot, pretty much by definition.  That's sort of the
point.  Copying the data sucks too, but it's doing the copying on all
platforms I'd guess...)
I guess it's possible. Logging in to a server with 1024 bit
RSA and RSA authorized_key I get ~229 reallocs in mp_grow(),
not a massive number if spread over 45 seconds. The patch
below drops it to ~30 reallcs.

Magnus: It might be worth seeing if it changes your
timing. I haven't looked whether it increases memory usage.

Matt

--- libtommath/bn_mp_exptmod_fast.c     5a692f134deeab0992612206c16f8bf970b5088c
+++ libtommath/bn_mp_exptmod_fast.c     5391873ccf8a11171774425c69f584195b4fdba4
@@ -67,13 +67,13 @@ int mp_exptmod_fast (mp_int * G, mp_int

    /* init M array */
    /* init first cell */
-  if ((err = mp_init(&M[1])) != MP_OKAY) {
+  if ((err = mp_init_size(&M[1], P->used)) != MP_OKAY) {
       return err;
    }

    /* now init the second half of the array */
    for (x = 1<<(winsize-1); x<  (1<<  winsize); x++) {
-    if ((err = mp_init(&M[x])) != MP_OKAY) {
+    if ((err = mp_init_size(&M[x], P->alloc+1)) != MP_OKAY) {
        for (y = 1<<(winsize-1); y<  x; y++) {
          mp_clear (&M[y]);
        }
@@ -96,7 +96,7 @@ int mp_exptmod_fast (mp_int * G, mp_int

       /* automatically pick the comba one if available (saves quite a few 
calls/ifs) */
  #ifdef BN_FAST_MP_MONTGOMERY_REDUCE_C
-     if (((P->used * 2 + 1)<  MP_WARRAY)&&
+     if (((P->alloc * 2 + 1)<  MP_WARRAY)&&
            P->used<  (1<<  ((CHAR_BIT * sizeof (mp_word)) - (2 * DIGIT_BIT)))) 
{
          redux = fast_mp_montgomery_reduce;
       } else
@@ -133,7 +133,7 @@ int mp_exptmod_fast (mp_int * G, mp_int
    }

    /* setup result */
-  if ((err = mp_init (&res)) != MP_OKAY) {
+  if ((err = mp_init_size (&res, P->used)) != MP_OKAY) {
      goto LBL_M;
    }

============================================================
--- libtommath/bn_mp_init_copy.c        fd7c20c0ee3473615de23c59074cf5c6757a20ca
+++ libtommath/bn_mp_init_copy.c        841949a75e387e818f2f4d9adedff0ba9c9374c0
@@ -20,7 +20,7 @@ int mp_init_copy (mp_int * a, mp_int * b
  {
    int     res;

-  if ((res = mp_init (a)) != MP_OKAY) {
+  if ((res = mp_init_size (a, b->used)) != MP_OKAY) {
      return res;
    }
    return mp_copy (b, a);
============================================================
--- libtommath/bn_mp_mod.c      3bed12926c4d019853f2b4dac814a7505580380e
+++ libtommath/bn_mp_mod.c      9265cd0294d2c86f1c3c73eaa5bf19c30403e13b
@@ -22,7 +22,7 @@ mp_mod (mp_int * a, mp_int * b, mp_int *
    mp_int  t;
    int     res;

-  if ((res = mp_init (&t)) != MP_OKAY) {
+  if ((res = mp_init_size (&t, b->used)) != MP_OKAY) {
      return res;
    }

============================================================
--- libtommath/bn_mp_mulmod.c   935d0f5903589ddf62f42fc691cb2f83aa2832c4
+++ libtommath/bn_mp_mulmod.c   ef9063432e3a0c62b7118dfc3d01d04cd4dc8bb9
@@ -21,7 +21,7 @@ int mp_mulmod (mp_int * a, mp_int * b, m
    int     res;
    mp_int  t;

-  if ((res = mp_init (&t)) != MP_OKAY) {
+  if ((res = mp_init_size (&t, c->used)) != MP_OKAY) {
      return res;
    }


Reply via email to