https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89597
Agner Fog <agner at agner dot org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |agner at agner dot org --- Comment #1 from Agner Fog <agner at agner dot org> --- I can confirm this. When compiling for a Win64 target, gcc version 9.2.0 (and earlier) returns 128-bit intrinsic vectors in XMM0, while 256-bit and 512-bit intrinsic vectors are returned through a pointer. Clang, MS and Intel compilers return all these vectors in registers. The Microsoft Windows documentation for x64 calling convention says: "Non-scalar types including floats, doubles, and vector types such as __m128, __m128i, __m128d are returned in XMM0." (https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019#return-values) Obviously, this document needs to be updated, but the only logical interpretation is that the wording "vector types such as __m128" includes larger intrinsic vectors, which must necessarily be returned in YMM0 or ZMM0. Test case: ``` __m128 square_x (__m128 x) { return _mm_mul_ps( x , x); } __m256 square_y (__m256 y) { return _mm256_mul_ps( y , y); } __m512 square_z (__m512 z) { return _mm512_mul_ps( z , z); } ``` Disassembly (Intel syntax): ``` _Z8square_xDv4_f:; Function begin vmovaps xmm0, oword [rcx] vmulps xmm0, xmm0, xmm0 ret ; _Z8square_xDv4_f End of function _Z8square_yDv8_f:; Function begin vmovaps ymm0, yword [rdx] vmulps ymm0, ymm0, ymm0 mov rax, rcx vmovaps yword [rcx], ymm0 vzeroupper ret ; _Z8square_yDv8_f End of function _Z8square_zDv16_f:; Function begin vmovaps zmm0, zword [rdx] vmulps zmm0, zmm0, zmm0 mov rax, rcx vmovaps zword [rcx], zmm0 vzeroupper ret ; _Z8square_zDv16_f End of function ``` Same, compiled with Clang, MS or Intel compilers: ``` _Z8square_yDv8_f:; Function begin vmovaps ymm0, yword [rcx] vmulps ymm0, ymm0, ymm0 ret ; _Z8square_yDv8_f End of function _Z8square_zDv16_f:; Function begin vmovaps zmm0, zword [rcx] vmulps zmm0, zmm0, zmm0 ret ; _Z8square_zDv16_f End of function ``` ... And while we are at it: It would be nice if you could support __vectorcall for win64 targets (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89485)