> > PIII 700MHz > > rsa 1024 bits 0.0100s 0.0005s 100.3 2021.9 > > > > PIV 1.2GHz > > rsa 1024 bits 0.0094s 0.0005s 106.9 2004.7 > > > > ... the performance are more > > or less the same (or > > better for the supposed slower processor in same cases). Haven't you figured out that P4 sticks and stinks badly yet? It's absolutely worse IA-32 implementation ever! > The P4 has a longer pipeline than the P3, So what? How long is pipeline shows when you fail to predict branch direction and it's not the case here (at least mispredicted branch penalties are negligible portion here). Even instruction latency (it takes P4 over 3 times longer to produce result of integer multiplication) is hardly limiting factor here as apparently it has enough "shadow" registers alailable for aliasing to sustain larger latency. It's multiplication throughput that holds P4 back. PIII is capable to issue one muliplication per clockcycle while P4 can schedule multiplication only every 3-rd cycle. Now let's assume both out-of-order cores are perfect and succeed to fully pipeline the code. We have one multiplication and 4 additions per loopspan to spread between FPU (integer multiplications are performed by FPU) and two IALUs. This means that (at their very best) PIII is capable to spit out one word or the result in two cycles (4 adds for 2 IALUs), while P4 - in three (because is starves for FPU, recall every 3rd cycle). This in turn means that if both were operating on same frequency, PIII would beat P4 by handful 50%. Keep in mind that it's best case scenario, real-life figure doesn't have to be exactly that, but the sign and order of magnitude are right here... > AFAIK, the asm code in OpenSSL doesn't take advantage of SSE2, No, it doesn't and it won't unless some of you (OpenSSL users) come forward and implement it. I however find it hard to believe it will go *significantly* faster (if it will go faster). The catch is that you can't schedule SSE2 instruction every cycle, but every 2nd one. You might come to believe that it effectively brings you back to PIII case (schedule one 32x32->64 multiplication per cycle) but the problem is that you would have to perform a number of word permutaions on XMM registers and very likely run out of SSE2 (those performing SSE2 instructions) execution units before you get in line with PIII... > nor is its scheduling optimized for the P4. It's out-of-order execution core and code doesn't have to be "scheduled just for P4," blended code works just fine (in this case). At least it would be foolish to expect it will bring it in line with PIII. Andy. ______________________________________________________________________ OpenSSL Project http://www.openssl.org Development Mailing List [EMAIL PROTECTED] Automated List Manager [EMAIL PROTECTED]