On Tue, 2005-04-05 at 18:17, Andy Polyakov wrote:
> >     Current OpenSSL (0.9.8-dev) rc4speed throughput on a Nocona (Em64t, 
> > b4bit) 3.6GHz is 272Mb/s, while this version of RC4 code can archive 
> > 536Mb/s in RC4Speed.
> > 
> > ããWould you please review it?
> 
> Cool conditional moves in unrolled loop. Have you considered/tried cmov 
> instead of jump over move instuction? No, there is no conditional move 
> with zero extention, but upper part is maintained zeroed, so that byte 
> cmov shoud do... Well, I bet those jumps are seldom taken, so that 
> branch prediction logic can make better job than cmov, but I have to ask:-)
> 
  Well, I tried use cmov here, it just slow down the throughput a lot...
> Or how about moving mozb (%rdi,%r10),%r8d upwards as movzb 
> (%rdi,%r10),%r14b and make inter-register move between r8 and r14 
> conditional?
> 
  I will try it.
> The reason I didn't attempt to unroll the RC4_CHAR loop was that I never 
> had access to EM64T hardware and simply mechanically ported P4 loop from 
> 32-bit implementation [where unrolling affected performance negatively] 
> and tested it for correctness on Opteron.
> 
> BTW, 272MBps at 3.6GHz? I get 262MBps out of [as just mentioned 
> virtually identical] 32-bit code at 2.4GHz P4... A.
> 
   
  In fact, Your implement on EM64t isn't that slow if 
  we change the inc and dec to add and sub. :) 

  With that change the throughput boost from 272Mb/s to 396Mb/s. 

  I have not investigated the 32 bit P4 path yet, 
  But you should see performance gain on P4 with this change. 
  
  Zou Nan hai

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [EMAIL PROTECTED]

Reply via email to