Hello,

first of all I would like to thank you for a great piece of software.

Now I would like to ask you for help or advice with a problem I am trying
to solve.
In our software there were mysterious cases when a 2D floating point origin
did not get restored properly. The problem demonstrated only at highest
optimization levels and only on some kinds of Intel CPUs.

The code demonstrating the issue and corresponding assembly looks as
follows:

driver.adjustOrigin( p.x, p.y );
    mov         rax,qword ptr [rsi]
    movapd      xmm2,xmm12
    movapd      xmm1,xmm11
    mov         rcx,rsi
    call        qword ptr [rax+0B8h]
driver.render( *this );
    mov         r11,qword ptr [rsi]
    mov         rdx,rbx
    mov         rcx,rsi
    call        qword ptr [r11+1D8h]
driver.adjustOrigin( -p.x, -p.y );
    mov         r11,qword ptr [rsi]
    mov         rcx,rsi
    xorpd       xmm12,xmm6
    xorpd       xmm11,xmm6
    movapd      xmm2,xmm12
    movapd      xmm1,xmm11
    call        qword ptr [r11+0B8h]

The problem was that on some processors the XMM11 and XMM12 registers lost
their value (were zeroed) during the driver.render( *this ) method call.
Such behavior was weird as the XMM6 to XMM15 registers are specified as
Non-volatile in ABI Win64 convention
(see Win64 ABI Register
usage<http://msdn.microsoft.com/en-us/library/9z1stfyw.aspx>
 or Software consequences of extending XMM to
YMM<http://software.intel.com/en-us/forums/topic/301853>).

Other weird thing was that the problem demonstrated just on Intel® Xeon®
Processor E5-1607 <http://ark.intel.com/products/64619/> (AVX extension
set),
but not on an older Intel® Xeon® Processor
E5530<http://ark.intel.com/products/37103/> (no
AVX extension).

After some digging deeper I realized that the Sha1_Update() openSSL
function zeroed the registers.

More specifically - the code zeroing the registers was in the
HASH_BLOCK_DATA_ORDER macro:

sha1_block_data_order_avx()
    movaps      xmmword ptr [rsp+40h],xmm6
    movaps      xmmword ptr [rsp+50h],xmm7
    movaps      xmmword ptr [rsp+60h],xmm8
    movaps      xmmword ptr [rsp+70h],xmm9
    movaps      xmmword ptr [rsp+80h],xmm10
    mov         r8,rdi
    mov         r9,rsi
    mov         r10,rdx
    vzeroall

The VZEROALL instruction zeroes all XMM registers up to 15 while only
XMM6-XMM10 are stored to stack (and restored later on).
On CPUs without the AVX support the sha1_block_data_order_ssse3() gets
called instead and all works with no problems.
I realized that *Andy Polyakov* recently added the AVX2+BMI code
path<http://git.openssl.org/gitweb/?p=openssl.git;a=commit;h=cd8d7335afcdef97312e05a9bd29b17a00796f48>
which
uses the VZEROUPPER instruction instead of VZEROALL.
The question is whether the VZEROUPPER instruction should not be used
instead in the AVX code path as well.

As stated in "Calling conventions for different C++ compilers and operating
systems <http://www.agner.org/optimize/calling_conventions.pdf>":

> Functions that use YMM registers should issue the instruction VZEROUPPER
>  or VZEROALL before calling any ABI compliant function and before
> returning to any ABI compliant function. VZEROUPPER is used if the ABI
> specifies that some of theXMM registers must be preserved (64-bit
> Windows) or if an XMM register is used for parameter transfer or return
> value. VZEROALL can optionally be used instead of VZEROUPPER in other
> cases.

My knowledge of assembly language and CPU architecture is very limited, so
please, sorry for possibly stupid questions or any wrong conclusions I have
made.

Could you, please, look comment it, or possibly advice some workaround
(something like the "no-sse2" config option)?

Thank you very much.

Kind regards,
Petr Filipsky

Reply via email to