https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89597

Agner Fog <agner at agner dot org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |agner at agner dot org

--- Comment #1 from Agner Fog <agner at agner dot org> ---
I can confirm this. 

When compiling for a Win64 target, gcc version 9.2.0 (and earlier) returns
128-bit intrinsic vectors in XMM0, while 256-bit and 512-bit intrinsic vectors
are returned through a pointer. Clang, MS and Intel compilers return all these
vectors in registers.

The Microsoft Windows documentation for x64 calling convention says:

"Non-scalar types including floats, doubles, and vector types such as __m128,
__m128i, __m128d are returned in XMM0."
(https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019#return-values)

Obviously, this document needs to be updated, but the only logical
interpretation is that the wording "vector types such as __m128" includes
larger intrinsic vectors, which must necessarily be returned in YMM0 or ZMM0.

Test case:
```
__m128 square_x (__m128 x) {
    return _mm_mul_ps( x , x);
}

__m256 square_y (__m256 y) {
    return _mm256_mul_ps( y , y);
}

__m512 square_z (__m512 z) {
    return _mm512_mul_ps( z , z);
}
```

Disassembly (Intel syntax):
```
_Z8square_xDv4_f:; Function begin
        vmovaps xmm0, oword [rcx]
        vmulps  xmm0, xmm0, xmm0 
        ret                      
; _Z8square_xDv4_f End of function


_Z8square_yDv8_f:; Function begin
        vmovaps ymm0, yword [rdx]
        vmulps  ymm0, ymm0, ymm0 
        mov     rax, rcx         
        vmovaps yword [rcx], ymm0
        vzeroupper               
        ret                      
; _Z8square_yDv8_f End of function


_Z8square_zDv16_f:; Function begin
        vmovaps zmm0, zword [rdx]
        vmulps  zmm0, zmm0, zmm0 
        mov     rax, rcx         
        vmovaps zword [rcx], zmm0
        vzeroupper               
        ret                      
; _Z8square_zDv16_f End of function

```

Same, compiled with Clang, MS or Intel compilers:

```
_Z8square_yDv8_f:; Function begin
        vmovaps ymm0, yword [rcx]
        vmulps  ymm0, ymm0, ymm0 
        ret                      
; _Z8square_yDv8_f End of function


_Z8square_zDv16_f:; Function begin
        vmovaps zmm0, zword [rcx]
        vmulps  zmm0, zmm0, zmm0 
        ret                      
; _Z8square_zDv16_f End of function
```

... And while we are at it: It would be nice if you could support __vectorcall
for win64 targets (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89485)

Reply via email to