[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction

2023-08-07 Thread luoyonggang at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921

--- Comment #10 from 罗勇刚(Yonggang Luo)  ---
(In reply to Hongtao.liu from comment #9)

> > Without `-mbmi` option, gcc can not compile and all other three compiler
> > can compile.
> 
> As long as it keeps semantics(respect zero input), I think this is
> acceptable.

Yeap, it's acceptable, but consistence with Clang/MSVC/ICL would be better.
That would makes the cross-platform code easier, besides, GCC also works for
WIN32, that's needs GCC to be consistence with MSVC

[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction

2023-08-07 Thread luoyonggang at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921

--- Comment #8 from 罗勇刚(Yonggang Luo)  ---
(In reply to Hongtao.liu from comment #7)
> No, I think what clang does is correct,

Thanks, yeap, according to https://github.com/llvm/llvm-project/issues/64477
I think clang did it well.

GCC also needs handling the following code properly

```c
#ifdef _MSC_VER
#include 
__forceinline void
unreachable() {__assume(0);}
#else
#include 
inline __attribute__((always_inline)) void
unreachable() {
#if defined(__INTEL_COMPILER)
__assume(0);
#else
__builtin_unreachable();
#endif
}
#endif

int f(int a)
{
if (a == 0) {
  unreachable();
}
return _tzcnt_u32   (a);
}

```
According to https://godbolt.org/z/T9axzaPqj

gcc with `-O2 -mbmi -m32` option compiled to
```asm
f(int):
xor eax, eax
tzcnt   eax, DWORD PTR [esp+4]
ret
```

There is a redundant xor instrunction,
Without `-mbmi` option, gcc can not compile and all other three compiler
can compile.
This issue still make sense, gcc can fixes it in Clang's way

[Bug target/110921] Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction

2023-08-06 Thread luoyonggang at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921

--- Comment #6 from 罗勇刚(Yonggang Luo)  ---
MSVC also added, clang seems have optimization issue, but MSVC doesn't have
that


https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYTStJg1DIApACYAQuYukl9ZATwDKjdAGFUtAK4sGIMwDMpK4AMngMmAByPgBGmMQg0gAOqAqETgwe3r7%2BQSlpjgJhEdEscQnSdpgOGUIETMQEWT5%2BgbaY9oUMdQ0ExVGx8Ym29Y3NOW0Ko33hA2VDkgCUtqhexMjsHOYBeFRYVADUAPoAskJuRwBq2ABKJhoAgtvhyN5YByYBbuEExOEAdAhPth7k9Ah0lKDngxXl53p83KoABwANh%2BfwYgOBUPBDHwVChjx%2BByoEGJTFIB2JMUWUIA7FZHgdmSzmcRMAQ1gxjgQAF7IQRHLwBMwsiBMWkBRlPOkAEQ4y1onAArLw/BwtKRUJw3NZrAcFKt1pgPoEeKQCJoFcsANYSFH/DQBZUogJ0lEATkkkjpHpRkiCSo4kjVVq1nF4ChAGgtVuWcFgSDQLCSdHi5EoydT9ASADdkEkkkdc1wPUcDARMFMjqp/aQsLm8BtLnhMAB3ADySUYnHNNFoleIUYgMTDMXCDQAnr3eOPmMRJx2YtpqpbuLxk2xBB2GLRpxreFgYl5gG4xLQo%2Bv65gWIZgOID/W8Oyarmq2HMKpql5KzPyIIOjDWg8BiYgpw8LAw1%2BPAWBnZYqAMYAFBbdsux7K9%2BEEEQxHYKQZEERQVHUR9dDMfQ7xQPVLH0ECo0gZZUCSLpLwAWg7MxeFQN9iD%2BLA6IgZYqhqZwIFccY/C4YI8X6Upyj0fJ0gEcT5NSRSGBkwYEkkoSuh6MZPBaPQdNqaYNPmLSRl6ZTtNM2ZZKGLhBKNDYJEVFVQ0fbUOAOWtJAOFgFHzA4Sw9f4KyrAgDggXBCBIU0Akc3g1y0RZbQkSR/mVDQzC9JENC4FENDyuknX0TgQ1IdVNS8yNo1jA94xgRAUFQFM0zICgICzdqUAMIwjl%2BLwGBta9G2bVtO27dU%2BzoQdh1HR85ynP8loXJcVwcP9N0YAgdz3MMjxPM9aAvP8sFvIwH01fAX0cN9L01T9v1/K8fkAx9gNA8CME2TVoNg9d4MQ5CJrQ6beEw4RRHEPDIcItQw10SS%2BuMKibE%2B/iGKYjJWI7AJOO43j33o9pOgyFw8WsqT0DMuTJIUroqYZjJaYc0nVxMqyDJybSOg5gQ9JmEpNKM6YqamXpWa0py1hcxyyo4VVKrDLyfNRFj/QOV47yiwbhsWKKYqIYh4sS%2BqUrSswHTMLgNDpJEzEkKRlQK91pCDCqqs4iNbDq5KFVIBNmu6nMMy61rsyGJtkDMbLSxjBsm0wFDJvQmaB3ieaxwnBcVpzxdl1XLbWq3Xbd33a7MGPU9z0vc1zrvK7D2fDn7o/L9kB/TZzTeoNNU%2BsCFwg36kr%2BAGeCBpgkJTsG/0h7CYekOGlARkiQEDCjTEsawaJiTGtWxgRcY4rVCbwPj4EEvnhL8UTKe5iTqallSCgyJnVK6Z/ebJgWxYfozr66VssLcyosubZEfhLRoX8ZbGlcgrJWXtwzeWRCiDWflo4HFjjbf4oUNCG3wMbU2iwkpxjSlwJE/xUTKh9MqP0eUzD2yRArT2KsfZRhjP7Rq8AIBIFWAQJIP4w4h3iJEVgmxUHoIOMAZAyBdbECGjaEhwRCEkHPnoee0NcJL1kPDYimpSKkDbGBJIcEEEeWqpwDsP5BGRVQIcXy/lApyJCmFJglYphRQ8G1HM8UzDKK4WlOk/w6QUMkEiD0jtspmCRG6MiHsLHew4LVThcZA5NV4S1Hx6ZOoiISMALgIpRpJxnlNOes1M6UAWpqVa%2B5zS1PWkXK821tzlwOlXI6tczo3kbiPJ8t08Bt0fE9TuL0e4AT7rwAe31IKPn%2BmYhCU8QaoTKRhWQC9tH4XkCvfROhhgo0otvaiGNL4H2YpwNiooWJlBYHgAm8Qib72MiJMS/9JKhDsiLemH835vNIMzIonzQHf35t0P%2BECAE/zBZLIFdNLL6QhTZGFIC6awLlm5RWiTkFqzQZrGRcjCn/FFNFVRJtthm0CaQO0BV/gojpfShlDL3blSxTVX2qSGoYpPkgtllLuJpGcJIIAA%3D

The demo code is:

#ifdef _MSC_VER
#include 
#else
#include 
#endif

int f(int a, int b)
{
return _tzcnt_u32   (a);
}

[Bug target/110921] New: Relax _tzcnt_u32 support x86, all x86 arch support for this instrunction

2023-08-06 Thread luoyonggang at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110921

Bug ID: 110921
   Summary: Relax _tzcnt_u32 support x86, all x86 arch support for
this instrunction
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: luoyonggang at gmail dot com
  Target Milestone: ---

Created attachment 55695
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55695=edit
A patch for fixes this issue in a brute way, just for demo

The compiling error:

In file included from
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/13.1.0/include/x86gprintrin.h:41,
 from
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/13.1.0/include/x86intrin.h:27,
 from C:/CI-Tools/msys64/mingw64/include/intrin.h:69,
 from ../../src/amd/addrlib/src/core/addrcommon.h:33,
 from ../../src/amd/addrlib/src/core/addrobject.h:21,
 from ../../src/amd/addrlib/src/core/addrlib.h:21,
 from ../../src/amd/addrlib/src/core/addrlib2.h:20,
 from ../../src/amd/addrlib/src/gfx10/gfx10addrlib.h:19,
 from ../../src/amd/addrlib/src/gfx10/gfx10addrlib.cpp:16:
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/13.1.0/include/bmiintrin.h:
In function 'unsigned int Addr::BitScanForward(unsigned int)':
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/13.1.0/include/bmiintrin.h:116:1:
error: inlining failed in call to 'always_inline' 'unsigned int
_tzcnt_u32(unsigned int)': target specific option mismatch
  116 | _tzcnt_u32 (unsigned int __X)
  | ^~
../../src/amd/addrlib/src/core/addrcommon.h:343:23: note: called from here
  343 | out = ::_tzcnt_u32(mask);
  |   ^~
[113/] Compiling C++ object
src/amd/compiler/libaco.a.p/aco_insert_NOPs.cpp.obj

This code can be compiled with MSVC and Clang.

Reason:

Since the TZCNT instruction behaves as BSF on non-BMI targets, there is code
that expects
to use it as a potentially faster version of BSF.

This is for alignment with Clang and MSVC.

Also the _mm_tzcnt_32 and _mm_tzcnt_64 are added for consistence with Clang and
MSVC

[Bug target/108191] Add support to usage of *intrin.h without -mavx512f -mavx512cd

2022-12-21 Thread luoyonggang at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108191

--- Comment #6 from 罗勇刚(Yonggang Luo)  ---
Is the following command are valid usage? It's compiled properly

```

// compile args:  -fPIC -O2 -D__SSE3__=1 -D__SSSE3__=1 -D__SSE4_1__=1
-D__SSE4_2__=1 -D__SSE4A__=1 -D__POPCNT__=1 -D__XSAVE__=1 -D__CRC32__=1
-D__AVX__=1 -D__AVX2__=1 -D__FP_FAST_FMAF32=1 -D__FP_FAST_FMAF64=1
-D__FP_FAST_FMAF=1 -D__FP_FAST_FMAF32x=1 -D__AVX512F__=1 -D__AVX512CD__=1
#include 

#pragma GCC push_options
#pragma GCC target("avx512f")
#pragma GCC target("avx512cd")
#pragma GCC target("sse4a")

#if defined(_MSC_VER)
#include 
#else
#include 
#endif

#pragma GCC pop_options


#pragma GCC push_options
#pragma GCC target("avx512f")
#pragma GCC target("avx512cd")
#pragma GCC target("sse4a")

void util_fadd_512(float *a, float *b, float *c) {
/* a = b + c */
__m512 av = _mm512_load_ps(a);
__m512 bv = _mm512_load_ps(b);
__m512 cv = _mm512_add_ps(av, bv);
_mm512_store_ps(c, cv);
}
static inline int
util_iround(float f)
{
   __m128 m = _mm_set_ss(f);
   return _mm_cvtss_i32(m);
}

#pragma GCC pop_options

int util_iround_outside(int x, float y) {
return x + util_iround(y);
}
float util_fadd(float a, float b) {
   return a + b;
}
```

[Bug target/108191] Add support to usage of *intrin.h without -mavx512f -mavx512cd

2022-12-20 Thread luoyonggang at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108191

--- Comment #4 from 罗勇刚(Yonggang Luo)  ---
(In reply to Richard Biener from comment #3)
> I suppose the issue will be that __attribute__((target)) isn't supported by
> MSVC?  But indeed this isn't something we are going to support.  Note
> another way is to put the functions into different translation units.

gcc is enough, no need care about msvc, msvc can support without attribute, we
can  use macro to deal with that.

[Bug target/108191] Add support to usage of *intrin.h without -mavx512f -mavx512cd

2022-12-20 Thread luoyonggang at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108191

--- Comment #2 from 罗勇刚(Yonggang Luo)  ---
(In reply to Jakub Jelinek from comment #1)
> You are lying to the compiler, don't.  In GCC you can #include 
> with SSE2 only and later in say __attribute__((target ("avx512cd")))
> function use avx512f/avx512cd intrinsics, no need to do the what you show
> above.

Can you be more specific, show me the code, thanks:)

[Bug c/108191] New: Add support to usage of *intrin.h without -mavx512f -mavx512cd

2022-12-20 Thread luoyonggang at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108191

Bug ID: 108191
   Summary: Add support to usage of *intrin.h without -mavx512f
-mavx512cd
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: luoyonggang at gmail dot com
  Target Milestone: ---

This is for getting the following command to be works
```
gcc -fPIC -O2 -D__SSE3__=1 -D__SSSE3__=1 \
-D__SSE4_1__=1 -D__SSE4_2__=1 -D__SSE4A__=1 \
-D__POPCNT__=1 -D__XSAVE__=1 -D__CRC32__=1 \
-D__AVX__=1 -D__AVX2__=1 \
-D__FP_FAST_FMAF32=1 \
-D__FP_FAST_FMAF64=1 \
-D__FP_FAST_FMAF=1 \
-D__FP_FAST_FMAF32x=1 \
-D__AVX512F__=1 -D__AVX512CD__=1 test.c
```
That is generating code for SSE2 only, and we can using 
#include 
by using runtime flags.

Indeed, MSVC are aready can did that, if gcc can also support for that, we can
reduce the usage of inline assembly, because MSVC(x64) doesn't support for
inline assembly, so that we can reduce the code complex

The content of test.c is:
```
#if defined(_MSC_VER)
#include 
#else
#include 
#endif

#include 

static inline int
util_iround(float f)
{
   __m128 m = _mm_set_ss(f);
   return _mm_cvtss_i32(m);
}

int util_iround_outside(int x, float y) {
return x + util_iround(y);
}
```

The compile error is something like:
```
In file included from
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/12.2.0/include/immintrin.h:35,
 from
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/12.2.0/include/x86intrin.h:32,
 from test.c:4:
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/12.2.0/include/pmmintrin.h:
In function '_mm_addsub_ps':
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/12.2.0/include/pmmintrin.h:53:3:
error: cannot convert a value of type 'int' to vector type '__vector(4) float'
which has different size
   53 |   return (__m128) __builtin_ia32_addsubps ((__v4sf)__X, (__v4sf)__Y);
  |   ^~
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/12.2.0/include/pmmintrin.h:
In function '_mm_hadd_ps':
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/12.2.0/include/pmmintrin.h:59:3:
error: cannot convert a value of type 'int' to vector type '__vector(4) float'
which has different size
   59 |   return (__m128) __builtin_ia32_haddps ((__v4sf)__X, (__v4sf)__Y);
  |   ^~
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/12.2.0/include/pmmintrin.h:
In function '_mm_hsub_ps':
C:/CI-Tools/msys64/mingw64/lib/gcc/x86_64-w64-mingw32/12.2.0/include/pmmintrin.h:65:3:
error: cannot convert a value of type 'int' to vector type '__vector(4) float'
which has different size
   65 |   return (__m128) __builtin_ia32_hsubps ((__v4sf)__X, (__v4sf)__Y);
```