Re: MONTMUL performance: t4 engine vs inlined t4

Andy Polyakov Fri, 31 May 2013 01:47:13 -0700

>> Another question is about suitability of floating-point fcmps and fmovd
>> instructions. These are used to pick a vector from powers table in
>> cache-timing neutral manner. I have to admit I haven't done due research
>> whether or not they are optimal choice in the context, and/or whether or
>> not we are better off using fand and for instructions for this purpose.
>> As instructions in question are floating-point they might be executed by
>> *shared* FPU and not by individual core [which might be disruptive for
>> pipeline?]...
> 
> fcmps is 11 cycle latency and executes in the external FPU.
> 
> Likewise for floating point conditional moves of floating point registers.
> 
> Floating point conditional moves of integer registers is the worst, it
> is split into two micro-ops and it breaks the instruction decode
> group.
> 
> Plain fmovd you should never use, it goes into the external FPU
> because it effects the condition codes in the %fsr.  Use fsrc2 isntead
> which has 1 cycle latency and executes in the front end of cpu.


Thanks! Even though the question was inadequately formulated (it was not
about just fmovd, but about *conditional* fmovd on floating-point
condition, sorry), I get the picture. The conclusion seems to be that we
should bet on logical operations, fand and for, which are 3 cycles and
[more importantly?] are handled by private core resources. Thanks again.

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Re: MONTMUL performance: t4 engine vs inlined t4

Reply via email to