[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-25 Thread slyfox at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #24 from Sergei Trofimovich  ---
Thank you, Alexander! Tricky `REG_EQUAL` makes sense.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-25 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #23 from Alexander Monakov  ---
(In reply to Sergei Trofimovich from comment #22)
> Here `pcmpeqd %xmm2,%xmm1` is a problematic instruction. Why does `gcc` use
> `%xmm2` (result of `cvttps2dq`) instead of, say `%xmm0` which contains
> `0x` pattern?

%xmm0 contains the first all-ones argument to assert_eq. The comparison needs
to happen against a register that contains 0x7fff_ (x4), and GCC "thinks"
that %xmm2 contains that value. At -O1, the optimization is done by the RTL CSE
pass (you can confirm by switching it off with -fdisable-rtl-cse1 along -O1),
it sees

(insn 6 5 7 2 (set (reg:V4SF 97)
(mem/u/c:V4SF (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0  S16 A128]))
"/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/xmmintrin.h":386:19 1335
{movv4sf_internal}
 (expr_list:REG_EQUAL (const_vector:V4SF [
(const_double:SF 4.294967296e+9 [0x0.8p+33]) repeated x4
])
(nil)))

(insn 10 9 11 2 (set (reg:V4SI 98)
(fix:V4SI (reg:V4SF 97)))
"/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/emmintrin.h":832:19 2635
{fix_truncv4sfv4si2}
 (expr_list:REG_EQUAL (const_vector:V4SI [
(const_int 2147483647 [0x7fff]) repeated x4
])
(expr_list:REG_DEAD (reg:V4SF 99)
(nil

Note the incorrect REG_EQUAL caused by inappropriate use of fix:V4SI where the
source code had cvttps2dq.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-24 Thread slyfox at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #22 from Sergei Trofimovich  ---
(In reply to Sergei Trofimovich from comment #21)

gcc generates the following code for this C code:

> int main() {
>   const __m128i su = _mm_set1_epi32(0x4f80);
>   const __m128  sf = _mm_castsi128_ps(su);
> 
>   const __m128  overflow_mask_f32 = _mm_cmpge_ps(sf,
> _mm_set1_ps(2147483648.0f));
>   const __m128i overflow_mask = _mm_castps_si128(overflow_mask_f32);
> 
>   const __m128i conv = _mm_cvttps_epi32(sf); // overflows
>   const __m128i yes = _mm_set1_epi32(INT32_MAX);
> 
>   const __m128i a = _mm_and_si128(overflow_mask, yes);
>   const __m128i na = _mm_andnot_si128(overflow_mask, conv);
> 
>   const __m128i conv_masked = _mm_or_si128(a, na);

Dump of assembler code for function main:
   0x00401020 <+0>: sub$0x8,%rsp
   0x00401024 <+4>: movss  0xfdc(%rip),%xmm2# 0x402008
   0x0040102c <+12>:movss  0xfd0(%rip),%xmm0# 0x402004
   0x00401034 <+20>:movss  0xfd0(%rip),%xmm3# 0x40200c
   0x0040103c <+28>:shufps $0x0,%xmm2,%xmm2
   0x00401040 <+32>:shufps $0x0,%xmm0,%xmm0
   0x00401044 <+36>:cmpleps %xmm2,%xmm0
   0x00401048 <+40>:cvttps2dq %xmm2,%xmm2
   0x0040104c <+44>:shufps $0x0,%xmm3,%xmm3
   0x00401050 <+48>:movdqa %xmm0,%xmm1
   0x00401054 <+52>:andps  %xmm3,%xmm0
   0x00401057 <+55>:pandn  %xmm2,%xmm1
   0x0040105b <+59>:por%xmm0,%xmm1

All of this all looks fine.

>   const __m128i actual = _mm_cmpeq_epi32(conv_masked,
> _mm_set1_epi32(INT32_MAX));
>   const __m128i expected = _mm_set1_epi32(-1);

   0x0040105f <+63>:pcmpeqd %xmm0,%xmm0
   0x00401063 <+67>:pcmpeqd %xmm2,%xmm1
   0x00401067 <+71>:call   0x401160 <_ZL9assert_eqDv2_xS_>

Here `pcmpeqd %xmm2,%xmm1` is a problematic instruction. Why does `gcc` use
`%xmm2` (result of `cvttps2dq`) instead of, say `%xmm0` which contains
`0x` pattern?

>   assert_eq(expected, actual);
> }
> ```

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-23 Thread slyfox at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #21 from Sergei Trofimovich  ---
Shrunk the example down to a single simpler function while preserving the
original masking intent:

```c
cat bug.cc
#include 
#include 
#include 

__attribute__((noipa))
static void assert_eq_p(void * l, void * r) {
char lb[16];
char rb[16];

__builtin_memcpy(lb, l, 16);
__builtin_memcpy(rb, r, 16);

if (__builtin_memcmp(lb, rb, 16) != 0) __builtin_trap();
}

__attribute__((noipa))
static void assert_eq(__m128i l, __m128i r) { assert_eq_p(&l, &r); }

int main() {
  const __m128i su = _mm_set1_epi32(0x4f80);
  const __m128  sf = _mm_castsi128_ps(su);

  const __m128  overflow_mask_f32 = _mm_cmpge_ps(sf,
_mm_set1_ps(2147483648.0f));
  const __m128i overflow_mask = _mm_castps_si128(overflow_mask_f32);

  const __m128i conv = _mm_cvttps_epi32(sf); // overflows
  const __m128i yes = _mm_set1_epi32(INT32_MAX);

  const __m128i a = _mm_and_si128(overflow_mask, yes);
  const __m128i na = _mm_andnot_si128(overflow_mask, conv);

  const __m128i conv_masked = _mm_or_si128(a, na);

  const __m128i actual = _mm_cmpeq_epi32(conv_masked,
_mm_set1_epi32(INT32_MAX));
  const __m128i expected = _mm_set1_epi32(-1);

  assert_eq(expected, actual);
}
```

The discrepancy:

Ok:
$ /tmp/gb/gcc/xg++ -Wall -B/tmp/gb/gcc bug.cc -o bug -O0 && ./bug
Bad:
$ /tmp/gb/gcc/xg++ -Wall -B/tmp/gb/gcc bug.cc -o bug -O2 && ./bug
Illegal instruction (core dumped)

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-22 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #20 from Alexander Monakov  ---
(In reply to Jakub Jelinek from comment #19)
> If we guarantee that we never constant fold FIX/UNSIGNED_FIX with
> -ftrapping-math (we shouldn't, as the exceptions should be raised), then
> using FIX/UNSIGNED_FIX is ok in
> flag_trapping_math guarded patterns even if the intrinsics have some
> specific value they want in those cases, because they will never be folded.

No, this sounds illogical. The compiler may fold out-of-range conversions just
fine, it just needs to preserve the original operation (with its result now
dead), or use any other opportunity to ensure that exceptions are still raised.
Specifically, it is okay to transform

float f;
int i;

f = 0x0.8p+33f;
i = (int) f;

to

f = 0x0.8p+33f;
dummy = (int) f;
i = INT_MAX; // or whatever

The "known" value of 'i' may then facilitate further folding.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-22 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #19 from Jakub Jelinek  ---
(In reply to Alexander Monakov from comment #18)
> No, allowing value-changing transformations under -ftrapping-math is really
> not appropriate. Invoking the intrinsic on a large floating-point value is
> not UB.

If we guarantee that we never constant fold FIX/UNSIGNED_FIX with
-ftrapping-math (we shouldn't, as the exceptions should be raised), then using
FIX/UNSIGNED_FIX is ok in
flag_trapping_math guarded patterns even if the intrinsics have some specific
value they want in those cases, because they will never be folded.
While for !flag_trapping_math, we should use UNSPEC because FIX/UNSIGNED_FIX
can be folded and would be folded to whatever the generic code decides, which
might disagree to what the intrinsics need.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-22 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #18 from Alexander Monakov  ---
No, allowing value-changing transformations under -ftrapping-math is really not
appropriate. Invoking the intrinsic on a large floating-point value is not UB.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-22 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #17 from Jakub Jelinek  ---
I don't think the cost of using UNSPEC would be significant if the backend
tried to constant fold more target builtins.  Anyway, with the proposed changes
perhaps you could keep using FIX/UNSIGNED_FIX for flag_trapping_math case even
for the intrinsics and use UNSPECs only for !flag_trapping_math.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-22 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #16 from Hongtao Liu  ---

> 
> That said, this change really won't help the backend which supposedly should
> have the same behavior regardless of -fno-trapping-math, because in that
> case it is the value
> of the result (which is unspecified by the standards) rather than whether an
> exception is triggered or not.
First, I agree with you, they're 2 separate issues.
What I proposed is just trying to find a balance that makes it possible not to
refine all cvtt* instructions to UNSPEC, because that would lose a lot of
optimization opportunities.
If it can be restricted under flag_trapping_math, at least those intrinsics are
fine at O2/O3.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-22 Thread slyfox at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #15 from Sergei Trofimovich  ---
> I shrunk bug.cc slightly further into shorter-bug.cc and now it fails
> equally on gcc-13 and gcc-15. I agree that gcc-15 just got more constant
> folds available now, but otherwise it's behaviour did not change.
> 
> But is it correct?

Specifically -O0 works:

  $ /tmp/gb/gcc/xg++ -Wall -B/tmp/gb/gcc bug.cc -o bug -O0 && ./bug
overflow(i): 0x 0x

And -O2 crashes:

  $ /tmp/gb/gcc/xg++ -Wall -B/tmp/gb/gcc bug.cc -o bug -O2 && ./bug
overflow(i): 0x 0x
  Illegal instruction (core dumped)

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-22 Thread slyfox at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #14 from Sergei Trofimovich  ---
Created attachment 58265
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58265&action=edit
shorter-bug.cc

I shrunk bug.cc slightly further into shorter-bug.cc and now it fails equally
on gcc-13 and gcc-15. I agree that gcc-15 just got more constant folds
available now, but otherwise it's behaviour did not change.

But is it correct?

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-22 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jsm28 at gcc dot gnu.org

--- Comment #13 from Jakub Jelinek  ---
(In reply to Hongtao Liu from comment #11)
> (In reply to Jakub Jelinek from comment #10)
> > Any of the floating point to integer intrinsics if they have out of range
> > value (haven't checked whether floating point to unsigned intrinsic is a
> > problem too or not).
> > No matter if it is float or double (dunno if _Float16 too, or __bf16), and
> > no matter if it is scalar intrinsic (ss/sd etc.) or vector and how many
> > vector elements.
> > But, this isn't really a regression, GCC has always behaved that way, the
> > only thing that actually changed is that perhaps we can constant fold more
> > than we used to do in the past.
> > When not using intrinsics, IMNSHO we should keep doing what we did before.
> 
> Can we restrict them under flag_trapping_math?

If we do that, we should do it on the fold-const.cc side as well.  There we
fold it but add TREE_OVERFLOW flag in the overflow cases to the resulting tree,
whether it prevents folding or not during GIMPLE passes would need to be
figured out.

C23 in F.4 says:
If the integer type is bool, 6.3.1.2 applies and the conversion raises no
floating-point exceptions if the floating-point value is not a signaling NaN.
Otherwise, if the floating value is infinite or NaN or if the integral part of
the floating value exceeds the range of the integer type, then the "invalid"
floating-point exception is raised and the resulting value is unspecified.
Otherwise, the resulting value is determined by 6.3.1.4. Conversion of an
integral floating value that does not exceed the range of the integer type
raises no floating-point exceptions; whether conversion of a non-integral
floating value raises the "inexact" floating-point exception is unspecified.

That said, this change really won't help the backend which supposedly should
have the same behavior regardless of -fno-trapping-math, because in that case
it is the value
of the result (which is unspecified by the standards) rather than whether an
exception is triggered or not.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

Andrew Pinski  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=11666

--- Comment #12 from Andrew Pinski  ---
So GCC behavior changed in 2003 to match more of what Java requires ...

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-21 Thread liuhongt at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #11 from Hongtao Liu  ---
(In reply to Jakub Jelinek from comment #10)
> Any of the floating point to integer intrinsics if they have out of range
> value (haven't checked whether floating point to unsigned intrinsic is a
> problem too or not).
> No matter if it is float or double (dunno if _Float16 too, or __bf16), and
> no matter if it is scalar intrinsic (ss/sd etc.) or vector and how many
> vector elements.
> But, this isn't really a regression, GCC has always behaved that way, the
> only thing that actually changed is that perhaps we can constant fold more
> than we used to do in the past.
> When not using intrinsics, IMNSHO we should keep doing what we did before.

Can we restrict them under flag_trapping_math?

.i.e

diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc
index 53f54d1d392..b7a770dad60 100644
--- a/gcc/simplify-rtx.cc
+++ b/gcc/simplify-rtx.cc
@@ -2256,14 +2256,25 @@ simplify_const_unary_operation (enum rtx_code code,
machine_mode mode,
   switch (code)
{
case FIX:
+ /* According to IEEE standard, for conversions from floating point to
+integer. When a NaN or infinite operand cannot be represented in
+the destination format and this cannot otherwise be indicated, the
+invalid operation exception shall be signaled. When a numeric
+operand would convert to an integer outside the range of the
+destination format, the invalid operation exception shall be
+signaled if this situation cannot otherwise be indicated.  */
  if (REAL_VALUE_ISNAN (*x))
-   return const0_rtx;
+   return flag_trapping_math ? NULL_RTX : const0_rtx;
+
+ if (REAL_VALUE_ISINF (*x) && flag_trapping_math)
+   return NULL_RTX;

  /* Test against the signed upper bound.  */
  wmax = wi::max_value (width, SIGNED);
  real_from_integer (&t, VOIDmode, wmax, SIGNED);
  if (real_less (&t, x))
-   return immed_wide_int_const (wmax, mode);
+   return (flag_trapping_math
+   ? NULL_RTX : immed_wide_int_const (wmax, mode));

  /* Test against the signed lower bound.  */
  wmin = wi::min_value (width, SIGNED);
@@ -2276,13 +2287,17 @@ simplify_const_unary_operation (enum rtx_code code,
machine_mode mode,

case UNSIGNED_FIX:
  if (REAL_VALUE_ISNAN (*x) || REAL_VALUE_NEGATIVE (*x))
-   return const0_rtx;
+   return flag_trapping_math ? NULL_RTX : const0_rtx;
+
+ if (REAL_VALUE_ISINF (*x) && flag_trapping_math)
+   return NULL_RTX;

  /* Test against the unsigned upper bound.  */
  wmax = wi::max_value (width, UNSIGNED);
  real_from_integer (&t, VOIDmode, wmax, UNSIGNED);
  if (real_less (&t, x))
-   return immed_wide_int_const (wmax, mode);
+   return (flag_trapping_math
+   ? NULL_RTX : immed_wide_int_const (wmax, mode));

  return immed_wide_int_const (real_to_integer (x, &fail, width),

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-21 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #10 from Jakub Jelinek  ---
Any of the floating point to integer intrinsics if they have out of range value
(haven't checked whether floating point to unsigned intrinsic is a problem too
or not).
No matter if it is float or double (dunno if _Float16 too, or __bf16), and no
matter if it is scalar intrinsic (ss/sd etc.) or vector and how many vector
elements.
But, this isn't really a regression, GCC has always behaved that way, the only
thing that actually changed is that perhaps we can constant fold more than we
used to do in the past.
When not using intrinsics, IMNSHO we should keep doing what we did before.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-21 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #9 from Jakub Jelinek  ---
In that case we should separate *.md patterns which are used for C conversions
from the patterns used by the intrinsics, keep using what we are right now for
the former and either use UNSPEC (the quickest but not best code generating
variant) or something more complex.  If UNSPEC is used, we could get some
constant folding back by adding gimple_fold handling for those and the like of
__builtin_ia32_psubusb128 or __builtin_ia32_pminub128.
For __builtin_ia32_cvttps2dq etc. obviously the folding should either punt
folding if some argument is out of range or fold those to what the hw does.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-21 Thread slyfox at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #8 from Sergei Trofimovich  ---
Thank you, Jakub! 

> The reason the testcase FAILs is the same as in the other PRs, it is trying 
> to convert {0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f} V4SFmode vector 
> to V4SImode, and because the backend sees the constant operand of the fix, it 
> folds it to the unspecified value as with scalar conversion.

To be super-clear: the problem is the out-of-range value, not just any
V4SFmode->V4SImode, right?

Specifically, float32{INT32_MAX} -> int32_t should be fine, right?

I was trying to extract the following example (and likely failed):

It tries very hard not to pass anything outside float32{INT32_MAX} to
(a different) `PromoteTo()` at the end of the function from
https://github.com/google/highway/blob/2270e77d0d0ccc1d6bc7393f0ebb0b6352ddfd00/hwy/ops/x86_128-inl.h#L10275

 HWY_API VFromD PromoteTo(D di64, VFromD> v) {
  const Rebind di32;
  const RebindToFloat df32;
  const RebindToUnsigned du32;
  const Repartition du32_as_du8;

  const auto exponent_adj = BitCast(
  du32,
  Min(SaturatedSub(BitCast(du32_as_du8, ShiftRight<23>(BitCast(du32, v))),
   BitCast(du32_as_du8, Set(du32, uint32_t{157}))),
  BitCast(du32_as_du8, Set(du32, uint32_t{32};
  const auto adj_v =
  BitCast(df32, BitCast(du32, v) - ShiftLeft<23>(exponent_adj));

  const auto f32_to_i32_result = ConvertTo(di32, adj_v);
  const auto lo64_or_mask = PromoteTo(
  di64,
  BitCast(du32, VecFromMask(di32, Eq(f32_to_i32_result,
 Set(di32, LimitsMax());

  return Or(PromoteTo(di64, BitCast(di32, f32_to_i32_result))
<< PromoteTo(di64, exponent_adj),
lo64_or_mask);
 }

Specifically `const auto f32_to_i32_result = ConvertTo(di32, adj_v);` hits
overflow
and the masking below should account for that (I tried to preserve masking in
the original sample):

https://github.com/google/highway/blob/2270e77d0d0ccc1d6bc7393f0ebb0b6352ddfd00/hwy/ops/x86_128-inl.h#L10870

  template 
  HWY_API VFromD ConvertTo(D di, VFromD> v) {
const RebindToFloat df;
// See comment at the first occurrence of "IfThenElse(overflow,".
const MFromD overflow = RebindMask(di, Ge(v, Set(df, 2147483648.0f)));
return IfThenElse(overflow, Set(di, LimitsMax()),
  ConvertInRangeTo(di, v));
  }

Is it obvious from the minimized C code where I got it into overflow condition?
Or constant folding propagates through masks here?

I'll try to re-minimize it again.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-21 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov  ---
> the question is if the intrinsic must behave the same or if those invalid 
> conversions are still unspecified.

I'd say it must match, the whole point of intrinsics is offering portable
interface to specific instructions. If I wanted a conversion with C semantics I
wouldn't need an intrinsic in the first place, it's more clearly expressed in
plain C.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-21 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #6 from Jakub Jelinek  ---
The standard GCC behavior is that out of range floating conversions to integers
result in signed integer maximum if the floating point value sign is clear and
signed integer minimum otherwise (including infinities/nans).

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-21 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #5 from Jakub Jelinek  ---
Trying
#include 

int
main ()
{
  float f = 0x0.8p+33f;
  float __attribute__((vector_size (16))) vf = { 0x0.8p+33f, 0x0.8p+33f,
0x0.8p+33f, 0x0.8p+33f };
  int a = f;
  int __attribute__((vector_size (16))) vi = __builtin_convertvector (vf, int
__attribute__((vector_size (16;
  int __attribute__((vector_size (16))) vi2 = (int __attribute__((vector_size
(16 _mm_cvttps_epi32 ((__m128) vf);
  __builtin_printf ("%d\n", a);
  __builtin_printf ("{%d, %d, %d, %d}\n", vi[0], vi[1], vi[2], vi[3]);
  __builtin_printf ("{%d, %d, %d, %d}\n", vi2[0], vi2[1], vi2[2], vi2[3]);
}
with gcc and clang both with -O0 and -O2, this prints
-2147483648
{-2147483648, -2147483648, -2147483648, -2147483648}
{-2147483648, -2147483648, -2147483648, -2147483648}
with both compilers at -O0, clang at -O2 prints
-1720305736
{8524448, 0, 1, 135169}
{-2147483648, -2147483648, -2147483648, -2147483648}
i.e. complete garbage for the non-intrinsic cases, but what the HW does for the
intrinsic cases, while gcc at -O2 prints
2147483647
{2147483647, 2147483647, 2147483647, 2147483647}
{2147483647, 2147483647, 2147483647, 2147483647}
i.e. treats intrinsics and non-intrinsics the same.
GCC behaves this way consistently since GCC 9 (before __builtin_convertvector
hasn't been supported), but if the vi related lines are commented out, even GCC
4.7 works that way.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-21 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #4 from Jakub Jelinek  ---
(In reply to Sergei Trofimovich from comment #3)
> Looking at -O2's bug.cc.265t.optimized tree optimizations come up with
> unfolded saturated sub8:
> 
>   _12 = __builtin_ia32_psubusb128 ({ -65, 0, 0, 0, -65, 0, 0, 0, -65, 0, 0,
> 0, -65, 0, 0, 0 }, { -99, 0, 0, 0, -99, 0, 0, 0, -99, 0, 0, 0, -99, 0, 0, 0
> });
>   _13 = __builtin_ia32_pminub128 (_12, { 32, 0, 0, 0, 32, 0, 0, 0, 32, 0, 0,
> 0, 32, 0, 0, 0 });
>   ...
> 
> 
> bug.cc.272r.cse1 still has that subtraction:
> 
> 5: r119:V16QI=[`*.LC0']
>   REG_EQUAL const_vector
> 6: r120:V16QI=[`*.LC1']
>   REG_EQUAL const_vector
> 7: r118:V16QI=us_minus(r119:V16QI,r120:V16QI)
> 
> bug.cc.273r.fwprop1 does not anymore:
> 
> 3: NOTE_INSN_BASIC_BLOCK 2
> 2: NOTE_INSN_FUNCTION_BEG
> 9: r122:V16QI=[`*.LC2']
>   REG_EQUAL const_vector
>13: r123:V4SI=r122:V16QI#0<<0x17
>   REG_EQUAL const_vector
>16: r128:SI=0x5f80
>15: r127:V4SI=vec_duplicate(r128:SI)
> 
> Could it be that constant folder "forgot" to generate anything for
> unsupported saturated-sub instead of leaving it as is?

No.  It is normal constant folding on RTL (not done on GIMPLE because
the i386 backend doesn't try to gimple fold __builtin_ia32_psubusb128
or __builtin_ia32_psubusb128).  0xbf - 0x9d is 0x22, so the us_minus works
actually in this case exactly like minus and because 0x20 is smaller than that,
the minimum is a vector with 0x20 elements (plus min (0 - 0, 0) = 0 elements).

The reason the testcase FAILs is the same as in the other PRs, it is trying to
convert
{0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f}
V4SFmode vector to V4SImode, and because the backend sees the constant operand
of the
fix, it folds it to the unspecified value as with scalar conversion.

Consider:
int
main ()
{
  volatile float f = 0x0.8p+33f;
  volatile float __attribute__((vector_size (16))) vf = { 0x0.8p+33f,
0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f };
  int a = f;
  int __attribute__((vector_size (16))) vi = __builtin_convertvector (vf, int
__attribute__((vector_size (16;
  __builtin_printf ("%d\n", a);
  __builtin_printf ("{%d, %d, %d, %d}\n", vi[0], vi[1], vi[2], vi[3]);
}
This prints
-2147483648
{-2147483648, -2147483648, -2147483648, -2147483648}
at -O0 or -O2, but with -O2 -Dvolatile= prints
2147483647
{2147483647, 2147483647, 2147483647, 2147483647}
instead.
Either is IMHO fine, the C standard doesn't specify what should be the result
of the conversion.
Now, whether for _mm_cvttps_epi32 etc. such cases are also unspecified or not
is debatable.  The Intel spec obviously specifies what the CPU instructions do
even in those otherwise unspecified cases, the question is if the intrinsic
must behave the same or if those invalid conversions are still unspecified.
If they'd be well defined when using the intrinsics, arguably the backend
shouldn't use FIX RTL but some UNSPEC, or should use the FIX RTL conditionally
(if_then_else:SI (argument_is_in_bounds) (fix arg) (const_int 0x800)).

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-20 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |15.0
 Target||x86_64-*-*
   Keywords||wrong-code
Version|14.0|15.0

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-20 Thread slyfox at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

--- Comment #3 from Sergei Trofimovich  ---
Looking at -O2's bug.cc.265t.optimized tree optimizations come up with unfolded
saturated sub8:

  _12 = __builtin_ia32_psubusb128 ({ -65, 0, 0, 0, -65, 0, 0, 0, -65, 0, 0, 0,
-65, 0, 0, 0 }, { -99, 0, 0, 0, -99, 0, 0, 0, -99, 0, 0, 0, -99, 0, 0, 0 });
  _13 = __builtin_ia32_pminub128 (_12, { 32, 0, 0, 0, 32, 0, 0, 0, 32, 0, 0, 0,
32, 0, 0, 0 });
  ...


bug.cc.272r.cse1 still has that subtraction:

5: r119:V16QI=[`*.LC0']
  REG_EQUAL const_vector
6: r120:V16QI=[`*.LC1']
  REG_EQUAL const_vector
7: r118:V16QI=us_minus(r119:V16QI,r120:V16QI)

bug.cc.273r.fwprop1 does not anymore:

3: NOTE_INSN_BASIC_BLOCK 2
2: NOTE_INSN_FUNCTION_BEG
9: r122:V16QI=[`*.LC2']
  REG_EQUAL const_vector
   13: r123:V4SI=r122:V16QI#0<<0x17
  REG_EQUAL const_vector
   16: r128:SI=0x5f80
   15: r127:V4SI=vec_duplicate(r128:SI)

Could it be that constant folder "forgot" to generate anything for unsupported
saturated-sub instead of leaving it as is?

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-20 Thread roger at nextmovesoftware dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

Roger Sayle  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2024-05-20

--- Comment #2 from Roger Sayle  ---
I can confirm that I can reproduce this and see the same thing.
Adding 
  vi tmp1 = Set_i32(INT32_MAX);
  d_i("tmp1",tmp1.raw);
at multiple places in bug.cc, reveals that sometimes the result is the correct
[0x7ff x 4], and at other places is the incorrect [0x8000 x 4], even
though this affected snippet doesn't involve binary operation simplification.

[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics

2024-05-20 Thread slyfox at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161

Sergei Trofimovich  changed:

   What|Removed |Added

 CC||roger at nextmovesoftware dot 
com,
   ||sayle at gcc dot gnu.org

--- Comment #1 from Sergei Trofimovich  ---
Bisect landed on r15-352-gf2449b55fb2d32

  commit f2449b55fb2d32fc4200667ba79847db31f6530d
  Author: Roger Sayle 
  Date:   Thu May 9 22:45:54 2024 +0100

Constant fold {-1,-1} << 1 in simplify-rtx.cc

But I'm not sure if it's the real cause or the code moved around enough. The
sample is somewhat fragile and resisted minimization.