dean gaudet wrote:
| 
| the way P4 handles partial register writes is that it assumes that a write 
| to a partial register imposes a dependency on subsequent partial or full 
| reads or writes [...]

Ok.

FWIW, Opteron and Athlon 64 also behave like this.

| this problem doesn't look like it would affect your code... but since both 
| prescott and k8 have movzb* in all their ALUs you probably won't lose much 
| trying this.

The fact is that I have no partial-register writes that could be replaced
by full-register writes.

| the above behaviours are true of the "dothan" pentium-M (family 6 model 
| 13) as well -- which means it's likely to be true of all intel em64t 
| processors going forward... i.e. they got rid of the partial register 
| write stall.

  "the above behaviours (dependency on subsequent partial or full reads or
   writes) [...] are true [for] all intel em64t processors"

     contradicts with

  "they got rid of the partial register write stall"

Do I understand you correctly ?

| i would be suspicious about your use of rotate to accumulate 8 bytes... on 
| EM64T it might be better to do that in memory... or perhaps try using shl 
| and then following the inner loop with a bswap before storing.  (shl can 
| be pipelined properly if it has a staggered 64-bit alu.)
| oh also, try "sub $1" instead of dec :)

On AMD64, 'dec' is as fast as 'sub $1', but replacing 'ror' by 'shl+bswap'
is indeed about 1% faster (thanks, I have updated rc4-amd64).

-- 
Marc Bevand                              http://epita.fr/~bevand_m
Computer Science School EPITA - System, Network and Security Dept.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [EMAIL PROTECTED]
Automated List Manager                           [EMAIL PROTECTED]

Reply via email to