dean gaudet wrote:
|
| the way P4 handles partial register writes is that it assumes that a write
| to a partial register imposes a dependency on subsequent partial or full
| reads or writes [...]
Ok.
FWIW, Opteron and Athlon 64 also behave like this.
| this problem doesn't look like it would affect your code... but since both
| prescott and k8 have movzb* in all their ALUs you probably won't lose much
| trying this.
The fact is that I have no partial-register writes that could be replaced
by full-register writes.
| the above behaviours are true of the "dothan" pentium-M (family 6 model
| 13) as well -- which means it's likely to be true of all intel em64t
| processors going forward... i.e. they got rid of the partial register
| write stall.
"the above behaviours (dependency on subsequent partial or full reads or
writes) [...] are true [for] all intel em64t processors"
contradicts with
"they got rid of the partial register write stall"
Do I understand you correctly ?
| i would be suspicious about your use of rotate to accumulate 8 bytes... on
| EM64T it might be better to do that in memory... or perhaps try using shl
| and then following the inner loop with a bswap before storing. (shl can
| be pipelined properly if it has a staggered 64-bit alu.)
| oh also, try "sub $1" instead of dec :)
On AMD64, 'dec' is as fast as 'sub $1', but replacing 'ror' by 'shl+bswap'
is indeed about 1% faster (thanks, I have updated rc4-amd64).
--
Marc Bevand http://epita.fr/~bevand_m
Computer Science School EPITA - System, Network and Security Dept.
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List [EMAIL PROTECTED]
Automated List Manager [EMAIL PROTECTED]