> > PIII 700MHz
> > rsa 1024 bits   0.0100s   0.0005s    100.3   2021.9
> >
> > PIV 1.2GHz
> > rsa 1024 bits   0.0094s   0.0005s    106.9   2004.7
> >
> > ... the performance are more
> > or less the same (or
> > better for the supposed slower processor in same cases).

Haven't you figured out that P4 sticks and stinks badly yet? It's
absolutely worse IA-32 implementation ever! 

> The P4 has a longer pipeline than the P3,

So what? How long is pipeline shows when you fail to predict branch
direction and it's not the case here (at least mispredicted branch
penalties are negligible portion here). Even instruction latency (it
takes P4 over 3 times longer to produce result of integer
multiplication) is hardly limiting factor here as apparently it has
enough "shadow" registers alailable for aliasing to sustain larger
latency. It's multiplication throughput that holds P4 back. PIII is
capable to issue one muliplication per clockcycle while P4 can schedule
multiplication only every 3-rd cycle. Now let's assume both out-of-order
cores are perfect and succeed to fully pipeline the code. We have one
multiplication and 4 additions per loopspan to spread between FPU
(integer multiplications are performed by FPU) and two IALUs. This means
that (at their very best) PIII is capable to spit out one word or the
result in two cycles (4 adds for 2 IALUs), while P4 - in three (because
is starves for FPU, recall every 3rd cycle). This in turn means that if
both were operating on same frequency, PIII would beat P4 by handful
50%. Keep in mind that it's best case scenario, real-life figure doesn't
have to be exactly that, but the sign and order of magnitude are right
here...

> AFAIK, the asm code in OpenSSL doesn't take advantage of SSE2,

No, it doesn't and it won't unless some of you (OpenSSL users) come
forward and implement it. I however find it hard to believe it will go
*significantly* faster (if it will go faster). The catch is that you
can't schedule SSE2 instruction every cycle, but every 2nd one. You
might come to believe that it effectively brings you back to PIII case
(schedule one 32x32->64 multiplication per cycle) but the problem is
that you would have to perform a number of word permutaions on XMM
registers and very likely run out of SSE2 (those performing SSE2
instructions) execution units before you get in line with PIII...

> nor is its scheduling optimized for the P4.

It's out-of-order execution core and code doesn't have to be "scheduled
just for P4," blended code works just fine (in this case). At least it
would be foolish to expect it will bring it in line with PIII.

Andy.

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [EMAIL PROTECTED]
Automated List Manager                           [EMAIL PROTECTED]

Reply via email to