Re: MONTMUL performance: t4 engine vs inlined t4

David Miller Fri, 31 May 2013 01:35:42 -0700

From: Andy Polyakov <[email protected]>
Date: Fri, 31 May 2013 10:29:37 +0200


> Another question is about suitability of floating-point fcmps and fmovd
> instructions. These are used to pick a vector from powers table in
> cache-timing neutral manner. I have to admit I haven't done due research
> whether or not they are optimal choice in the context, and/or whether or
> not we are better off using fand and for instructions for this purpose.
> As instructions in question are floating-point they might be executed by
> *shared* FPU and not by individual core [which might be disruptive for
> pipeline?]...

fcmps is 11 cycle latency and executes in the external FPU.

Likewise for floating point conditional moves of floating point registers.

Floating point conditional moves of integer registers is the worst, it
is split into two micro-ops and it breaks the instruction decode
group.

Plain fmovd you should never use, it goes into the external FPU
because it effects the condition codes in the %fsr.  Use fsrc2 isntead
which has 1 cycle latency and executes in the front end of cpu.
______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
Development Mailing List                       [email protected]
Automated List Manager                           [email protected]

Re: MONTMUL performance: t4 engine vs inlined t4

Reply via email to