Or how about moving mozb (%rdi,%r10),%r8d upwards as movzb (%rdi,%r10),%r14b and make inter-register move between r8 and r14 conditional?
I will try it.
I have tried it, not performance gain.
Does it mean that it's same or does it mean that it's slower? Was it cmov or was it jump over mov instruction? BTW, what is the latency/throughput for Intel cmov anyway? I can't find information anywhere...
Another question. Why rotations are 32-bit? Did you try 64-bit rotations and found them slow? If so, for how much?
You may wonder why all these questions. I want to understand the code to make it regular enough to express assembler unrolled loop in perl loop terms. It make it easier for us to maintain and I'm even ready to sacrifice few percents of performance for more regular looking code.
BTW, 272MBps at 3.6GHz? I get 262MBps out of [as just mentioned virtually identical] 32-bit code at 2.4GHz P4... A.
In fact, Your implement on EM64t isn't that slow if we change the inc and dec to add and sub. :)
With that change the throughput boost from 272Mb/s to 396Mb/s.
For *now* I'm committing only this change to CVS and will have closer look at unrolled loop later on [some time next week]. BTW, there is another idea I'd like to try, so I'm likely to send you some code for benchmarking on EM64T hardware. A.
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [email protected]
Automated List Manager [EMAIL PROTECTED]
