Hi Peter, Thanks for your reply, that's very informative.
I tried openssl and got down to around the 10s you're talking about. Doesn't leave space for much else though. I haven't found any real specs for the CPU - but the manufacturer mentions type and size of cache in their other CPUs but not in this one, so I fear the worst. I'll try your converted code and see where that brings me. Kind regards/Magnus Quoting Peter Turczak <[email protected]>: > Hi Magnus, > hi Rob, > > a while ago I made the same observations you did. On an m68k-nommu with 166 > MHz the RSA exchange took quite forever. After some profiling I found out the > comba multiply routine in libtommath was eating most of the time. It seems > gcc produces quite inefficient code there. Libtommath resizes its large > integers while calculating leading to more work for user memory management. > Therefore I converted dropbear to use libtomsfastmath which helped a lot at > the expense of a larger memory footprint. After porting some parts to > assembler (which libtomsfastmath has special hooks for) I cut down the time > to 10sec which is IMHO much better. > > The version I did was more a proof of concept and is not shiny and packed but > will compile, maybe you could have a look at it. > (http://peter.turczak.de/dropbear-tfm.tgz) > > Rob is right in a way, but openssl uses assembler all along. Furthermore a > missing L1 will contribute to slowing the keyexchange to a crawl. > > Best regards, > > Peter > > On Mar 15, 2011, at 10:25 PM, Rob Landley wrote: > > > On 03/15/2011 08:02 AM, Magnus Nilsson wrote: > >> Sorry, I was unclear - it's only 100% busy during those 45s. > >> > >> This is what it looks like if I first start the load monitor (-r outputs > >> 1 sample/second), then start to log in from a remote ssh client: > >> # cpu -r > >> CPU: busy 0% (system=0% user=0% nice=0% idle=100%) > >> CPU: busy 24% (system=4% user=19% nice=0% idle=75%) > >> CPU: busy 100% (system=1% user=98% nice=0% idle=0%) > >> CPU: busy 100% (system=0% user=100% nice=0% idle=0%) > >> <39 repeats of the above busy 100%> > >> CPU: busy 100% (system=2% user=97% nice=0% idle=0%) > >> CPU: busy 100% (system=8% user=91% nice=0% idle=0%) > >> CPU: busy 100% (system=22% user=77% nice=0% idle=0%) > >> CPU: busy 100% (system=0% user=100% nice=0% idle=0%) > >> CPU: busy 100% (system=0% user=100% nice=0% idle=0%) > >> CPU: busy 67% (system=8% user=58% nice=0% idle=32%) > >> CPU: busy 0% (system=0% user=0% nice=0% idle=100%) > >> > >> Thanks for the tip on prebuilt busybox Rob, but would I need it in flat > >> format. I don't think arm-elf-elf2flt can do that without reloc info or? > >> And from the above I don't think it would add much info. > >> > >> My question is: > >> Is 45s reasonable on a 192MHz cpu, > > > > No. I had a 200mhz celeron that did 3.2 ssh logins per second ten years > > ago. (I did a VPN built on top of ssh, dynamic port forwarding, and > > netcat, and had to benchmark it.) Going from i686 to arm could cost you > > some performance (ever since the pentium it's had multiple execution > > cores, speculative execution, instruction reordering and such), but > > there's no _way_ that's more than an order of magnitude in performance. > > I could see 4 seconds, but but 45 seconds is pathological. Something > > is wrong. > > > > My next step would be "stick printfs in the source code and see where > > the big delay is". > > > > Hmmm... Do they still _make_ CPUs with no L1 cache? > > > > Rob > >
