Hello, first of all I would like to thank you for a great piece of software.
Now I would like to ask you for help or advice with a problem I am trying to solve. In our software there were mysterious cases when a 2D floating point origin did not get restored properly. The problem demonstrated only at highest optimization levels and only on some kinds of Intel CPUs. The code demonstrating the issue and corresponding assembly looks as follows: driver.adjustOrigin( p.x, p.y ); mov rax,qword ptr [rsi] movapd xmm2,xmm12 movapd xmm1,xmm11 mov rcx,rsi call qword ptr [rax+0B8h] driver.render( *this ); mov r11,qword ptr [rsi] mov rdx,rbx mov rcx,rsi call qword ptr [r11+1D8h] driver.adjustOrigin( -p.x, -p.y ); mov r11,qword ptr [rsi] mov rcx,rsi xorpd xmm12,xmm6 xorpd xmm11,xmm6 movapd xmm2,xmm12 movapd xmm1,xmm11 call qword ptr [r11+0B8h] The problem was that on some processors the XMM11 and XMM12 registers lost their value (were zeroed) during the driver.render( *this ) method call. Such behavior was weird as the XMM6 to XMM15 registers are specified as Non-volatile in ABI Win64 convention (see Win64 ABI Register usage<http://msdn.microsoft.com/en-us/library/9z1stfyw.aspx> or Software consequences of extending XMM to YMM<http://software.intel.com/en-us/forums/topic/301853>). Other weird thing was that the problem demonstrated just on Intel® Xeon® Processor E5-1607 <http://ark.intel.com/products/64619/> (AVX extension set), but not on an older Intel® Xeon® Processor E5530<http://ark.intel.com/products/37103/> (no AVX extension). After some digging deeper I realized that the Sha1_Update() openSSL function zeroed the registers. More specifically - the code zeroing the registers was in the HASH_BLOCK_DATA_ORDER macro: sha1_block_data_order_avx() movaps xmmword ptr [rsp+40h],xmm6 movaps xmmword ptr [rsp+50h],xmm7 movaps xmmword ptr [rsp+60h],xmm8 movaps xmmword ptr [rsp+70h],xmm9 movaps xmmword ptr [rsp+80h],xmm10 mov r8,rdi mov r9,rsi mov r10,rdx vzeroall The VZEROALL instruction zeroes all XMM registers up to 15 while only XMM6-XMM10 are stored to stack (and restored later on). On CPUs without the AVX support the sha1_block_data_order_ssse3() gets called instead and all works with no problems. I realized that *Andy Polyakov* recently added the AVX2+BMI code path<http://git.openssl.org/gitweb/?p=openssl.git;a=commit;h=cd8d7335afcdef97312e05a9bd29b17a00796f48> which uses the VZEROUPPER instruction instead of VZEROALL. The question is whether the VZEROUPPER instruction should not be used instead in the AVX code path as well. As stated in "Calling conventions for different C++ compilers and operating systems <http://www.agner.org/optimize/calling_conventions.pdf>": > Functions that use YMM registers should issue the instruction VZEROUPPER > or VZEROALL before calling any ABI compliant function and before > returning to any ABI compliant function. VZEROUPPER is used if the ABI > specifies that some of theXMM registers must be preserved (64-bit > Windows) or if an XMM register is used for parameter transfer or return > value. VZEROALL can optionally be used instead of VZEROUPPER in other > cases. My knowledge of assembly language and CPU architecture is very limited, so please, sorry for possibly stupid questions or any wrong conclusions I have made. Could you, please, look comment it, or possibly advice some workaround (something like the "no-sse2" config option)? Thank you very much. Kind regards, Petr Filipsky