[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #24 from Sergei Trofimovich --- Thank you, Alexander! Tricky `REG_EQUAL` makes sense.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #23 from Alexander Monakov --- (In reply to Sergei Trofimovich from comment #22) > Here `pcmpeqd %xmm2,%xmm1` is a problematic instruction. Why does `gcc` use > `%xmm2` (result of `cvttps2dq`) instead of, say `%xmm0` which contains > `0x` pattern? %xmm0 contains the first all-ones argument to assert_eq. The comparison needs to happen against a register that contains 0x7fff_ (x4), and GCC "thinks" that %xmm2 contains that value. At -O1, the optimization is done by the RTL CSE pass (you can confirm by switching it off with -fdisable-rtl-cse1 along -O1), it sees (insn 6 5 7 2 (set (reg:V4SF 97) (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0 S16 A128])) "/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/xmmintrin.h":386:19 1335 {movv4sf_internal} (expr_list:REG_EQUAL (const_vector:V4SF [ (const_double:SF 4.294967296e+9 [0x0.8p+33]) repeated x4 ]) (nil))) (insn 10 9 11 2 (set (reg:V4SI 98) (fix:V4SI (reg:V4SF 97))) "/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include/emmintrin.h":832:19 2635 {fix_truncv4sfv4si2} (expr_list:REG_EQUAL (const_vector:V4SI [ (const_int 2147483647 [0x7fff]) repeated x4 ]) (expr_list:REG_DEAD (reg:V4SF 99) (nil Note the incorrect REG_EQUAL caused by inappropriate use of fix:V4SI where the source code had cvttps2dq.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #22 from Sergei Trofimovich --- (In reply to Sergei Trofimovich from comment #21) gcc generates the following code for this C code: > int main() { > const __m128i su = _mm_set1_epi32(0x4f80); > const __m128 sf = _mm_castsi128_ps(su); > > const __m128 overflow_mask_f32 = _mm_cmpge_ps(sf, > _mm_set1_ps(2147483648.0f)); > const __m128i overflow_mask = _mm_castps_si128(overflow_mask_f32); > > const __m128i conv = _mm_cvttps_epi32(sf); // overflows > const __m128i yes = _mm_set1_epi32(INT32_MAX); > > const __m128i a = _mm_and_si128(overflow_mask, yes); > const __m128i na = _mm_andnot_si128(overflow_mask, conv); > > const __m128i conv_masked = _mm_or_si128(a, na); Dump of assembler code for function main: 0x00401020 <+0>: sub$0x8,%rsp 0x00401024 <+4>: movss 0xfdc(%rip),%xmm2# 0x402008 0x0040102c <+12>:movss 0xfd0(%rip),%xmm0# 0x402004 0x00401034 <+20>:movss 0xfd0(%rip),%xmm3# 0x40200c 0x0040103c <+28>:shufps $0x0,%xmm2,%xmm2 0x00401040 <+32>:shufps $0x0,%xmm0,%xmm0 0x00401044 <+36>:cmpleps %xmm2,%xmm0 0x00401048 <+40>:cvttps2dq %xmm2,%xmm2 0x0040104c <+44>:shufps $0x0,%xmm3,%xmm3 0x00401050 <+48>:movdqa %xmm0,%xmm1 0x00401054 <+52>:andps %xmm3,%xmm0 0x00401057 <+55>:pandn %xmm2,%xmm1 0x0040105b <+59>:por%xmm0,%xmm1 All of this all looks fine. > const __m128i actual = _mm_cmpeq_epi32(conv_masked, > _mm_set1_epi32(INT32_MAX)); > const __m128i expected = _mm_set1_epi32(-1); 0x0040105f <+63>:pcmpeqd %xmm0,%xmm0 0x00401063 <+67>:pcmpeqd %xmm2,%xmm1 0x00401067 <+71>:call 0x401160 <_ZL9assert_eqDv2_xS_> Here `pcmpeqd %xmm2,%xmm1` is a problematic instruction. Why does `gcc` use `%xmm2` (result of `cvttps2dq`) instead of, say `%xmm0` which contains `0x` pattern? > assert_eq(expected, actual); > } > ```
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #21 from Sergei Trofimovich --- Shrunk the example down to a single simpler function while preserving the original masking intent: ```c cat bug.cc #include #include #include __attribute__((noipa)) static void assert_eq_p(void * l, void * r) { char lb[16]; char rb[16]; __builtin_memcpy(lb, l, 16); __builtin_memcpy(rb, r, 16); if (__builtin_memcmp(lb, rb, 16) != 0) __builtin_trap(); } __attribute__((noipa)) static void assert_eq(__m128i l, __m128i r) { assert_eq_p(&l, &r); } int main() { const __m128i su = _mm_set1_epi32(0x4f80); const __m128 sf = _mm_castsi128_ps(su); const __m128 overflow_mask_f32 = _mm_cmpge_ps(sf, _mm_set1_ps(2147483648.0f)); const __m128i overflow_mask = _mm_castps_si128(overflow_mask_f32); const __m128i conv = _mm_cvttps_epi32(sf); // overflows const __m128i yes = _mm_set1_epi32(INT32_MAX); const __m128i a = _mm_and_si128(overflow_mask, yes); const __m128i na = _mm_andnot_si128(overflow_mask, conv); const __m128i conv_masked = _mm_or_si128(a, na); const __m128i actual = _mm_cmpeq_epi32(conv_masked, _mm_set1_epi32(INT32_MAX)); const __m128i expected = _mm_set1_epi32(-1); assert_eq(expected, actual); } ``` The discrepancy: Ok: $ /tmp/gb/gcc/xg++ -Wall -B/tmp/gb/gcc bug.cc -o bug -O0 && ./bug Bad: $ /tmp/gb/gcc/xg++ -Wall -B/tmp/gb/gcc bug.cc -o bug -O2 && ./bug Illegal instruction (core dumped)
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #20 from Alexander Monakov --- (In reply to Jakub Jelinek from comment #19) > If we guarantee that we never constant fold FIX/UNSIGNED_FIX with > -ftrapping-math (we shouldn't, as the exceptions should be raised), then > using FIX/UNSIGNED_FIX is ok in > flag_trapping_math guarded patterns even if the intrinsics have some > specific value they want in those cases, because they will never be folded. No, this sounds illogical. The compiler may fold out-of-range conversions just fine, it just needs to preserve the original operation (with its result now dead), or use any other opportunity to ensure that exceptions are still raised. Specifically, it is okay to transform float f; int i; f = 0x0.8p+33f; i = (int) f; to f = 0x0.8p+33f; dummy = (int) f; i = INT_MAX; // or whatever The "known" value of 'i' may then facilitate further folding.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #19 from Jakub Jelinek --- (In reply to Alexander Monakov from comment #18) > No, allowing value-changing transformations under -ftrapping-math is really > not appropriate. Invoking the intrinsic on a large floating-point value is > not UB. If we guarantee that we never constant fold FIX/UNSIGNED_FIX with -ftrapping-math (we shouldn't, as the exceptions should be raised), then using FIX/UNSIGNED_FIX is ok in flag_trapping_math guarded patterns even if the intrinsics have some specific value they want in those cases, because they will never be folded. While for !flag_trapping_math, we should use UNSPEC because FIX/UNSIGNED_FIX can be folded and would be folded to whatever the generic code decides, which might disagree to what the intrinsics need.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #18 from Alexander Monakov --- No, allowing value-changing transformations under -ftrapping-math is really not appropriate. Invoking the intrinsic on a large floating-point value is not UB.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #17 from Jakub Jelinek --- I don't think the cost of using UNSPEC would be significant if the backend tried to constant fold more target builtins. Anyway, with the proposed changes perhaps you could keep using FIX/UNSIGNED_FIX for flag_trapping_math case even for the intrinsics and use UNSPECs only for !flag_trapping_math.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #16 from Hongtao Liu --- > > That said, this change really won't help the backend which supposedly should > have the same behavior regardless of -fno-trapping-math, because in that > case it is the value > of the result (which is unspecified by the standards) rather than whether an > exception is triggered or not. First, I agree with you, they're 2 separate issues. What I proposed is just trying to find a balance that makes it possible not to refine all cvtt* instructions to UNSPEC, because that would lose a lot of optimization opportunities. If it can be restricted under flag_trapping_math, at least those intrinsics are fine at O2/O3.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #15 from Sergei Trofimovich --- > I shrunk bug.cc slightly further into shorter-bug.cc and now it fails > equally on gcc-13 and gcc-15. I agree that gcc-15 just got more constant > folds available now, but otherwise it's behaviour did not change. > > But is it correct? Specifically -O0 works: $ /tmp/gb/gcc/xg++ -Wall -B/tmp/gb/gcc bug.cc -o bug -O0 && ./bug overflow(i): 0x 0x And -O2 crashes: $ /tmp/gb/gcc/xg++ -Wall -B/tmp/gb/gcc bug.cc -o bug -O2 && ./bug overflow(i): 0x 0x Illegal instruction (core dumped)
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #14 from Sergei Trofimovich --- Created attachment 58265 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=58265&action=edit shorter-bug.cc I shrunk bug.cc slightly further into shorter-bug.cc and now it fails equally on gcc-13 and gcc-15. I agree that gcc-15 just got more constant folds available now, but otherwise it's behaviour did not change. But is it correct?
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 Jakub Jelinek changed: What|Removed |Added CC||jsm28 at gcc dot gnu.org --- Comment #13 from Jakub Jelinek --- (In reply to Hongtao Liu from comment #11) > (In reply to Jakub Jelinek from comment #10) > > Any of the floating point to integer intrinsics if they have out of range > > value (haven't checked whether floating point to unsigned intrinsic is a > > problem too or not). > > No matter if it is float or double (dunno if _Float16 too, or __bf16), and > > no matter if it is scalar intrinsic (ss/sd etc.) or vector and how many > > vector elements. > > But, this isn't really a regression, GCC has always behaved that way, the > > only thing that actually changed is that perhaps we can constant fold more > > than we used to do in the past. > > When not using intrinsics, IMNSHO we should keep doing what we did before. > > Can we restrict them under flag_trapping_math? If we do that, we should do it on the fold-const.cc side as well. There we fold it but add TREE_OVERFLOW flag in the overflow cases to the resulting tree, whether it prevents folding or not during GIMPLE passes would need to be figured out. C23 in F.4 says: If the integer type is bool, 6.3.1.2 applies and the conversion raises no floating-point exceptions if the floating-point value is not a signaling NaN. Otherwise, if the floating value is infinite or NaN or if the integral part of the floating value exceeds the range of the integer type, then the "invalid" floating-point exception is raised and the resulting value is unspecified. Otherwise, the resulting value is determined by 6.3.1.4. Conversion of an integral floating value that does not exceed the range of the integer type raises no floating-point exceptions; whether conversion of a non-integral floating value raises the "inexact" floating-point exception is unspecified. That said, this change really won't help the backend which supposedly should have the same behavior regardless of -fno-trapping-math, because in that case it is the value of the result (which is unspecified by the standards) rather than whether an exception is triggered or not.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 Andrew Pinski changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=11666 --- Comment #12 from Andrew Pinski --- So GCC behavior changed in 2003 to match more of what Java requires ...
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #11 from Hongtao Liu --- (In reply to Jakub Jelinek from comment #10) > Any of the floating point to integer intrinsics if they have out of range > value (haven't checked whether floating point to unsigned intrinsic is a > problem too or not). > No matter if it is float or double (dunno if _Float16 too, or __bf16), and > no matter if it is scalar intrinsic (ss/sd etc.) or vector and how many > vector elements. > But, this isn't really a regression, GCC has always behaved that way, the > only thing that actually changed is that perhaps we can constant fold more > than we used to do in the past. > When not using intrinsics, IMNSHO we should keep doing what we did before. Can we restrict them under flag_trapping_math? .i.e diff --git a/gcc/simplify-rtx.cc b/gcc/simplify-rtx.cc index 53f54d1d392..b7a770dad60 100644 --- a/gcc/simplify-rtx.cc +++ b/gcc/simplify-rtx.cc @@ -2256,14 +2256,25 @@ simplify_const_unary_operation (enum rtx_code code, machine_mode mode, switch (code) { case FIX: + /* According to IEEE standard, for conversions from floating point to +integer. When a NaN or infinite operand cannot be represented in +the destination format and this cannot otherwise be indicated, the +invalid operation exception shall be signaled. When a numeric +operand would convert to an integer outside the range of the +destination format, the invalid operation exception shall be +signaled if this situation cannot otherwise be indicated. */ if (REAL_VALUE_ISNAN (*x)) - return const0_rtx; + return flag_trapping_math ? NULL_RTX : const0_rtx; + + if (REAL_VALUE_ISINF (*x) && flag_trapping_math) + return NULL_RTX; /* Test against the signed upper bound. */ wmax = wi::max_value (width, SIGNED); real_from_integer (&t, VOIDmode, wmax, SIGNED); if (real_less (&t, x)) - return immed_wide_int_const (wmax, mode); + return (flag_trapping_math + ? NULL_RTX : immed_wide_int_const (wmax, mode)); /* Test against the signed lower bound. */ wmin = wi::min_value (width, SIGNED); @@ -2276,13 +2287,17 @@ simplify_const_unary_operation (enum rtx_code code, machine_mode mode, case UNSIGNED_FIX: if (REAL_VALUE_ISNAN (*x) || REAL_VALUE_NEGATIVE (*x)) - return const0_rtx; + return flag_trapping_math ? NULL_RTX : const0_rtx; + + if (REAL_VALUE_ISINF (*x) && flag_trapping_math) + return NULL_RTX; /* Test against the unsigned upper bound. */ wmax = wi::max_value (width, UNSIGNED); real_from_integer (&t, VOIDmode, wmax, UNSIGNED); if (real_less (&t, x)) - return immed_wide_int_const (wmax, mode); + return (flag_trapping_math + ? NULL_RTX : immed_wide_int_const (wmax, mode)); return immed_wide_int_const (real_to_integer (x, &fail, width),
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #10 from Jakub Jelinek --- Any of the floating point to integer intrinsics if they have out of range value (haven't checked whether floating point to unsigned intrinsic is a problem too or not). No matter if it is float or double (dunno if _Float16 too, or __bf16), and no matter if it is scalar intrinsic (ss/sd etc.) or vector and how many vector elements. But, this isn't really a regression, GCC has always behaved that way, the only thing that actually changed is that perhaps we can constant fold more than we used to do in the past. When not using intrinsics, IMNSHO we should keep doing what we did before.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #9 from Jakub Jelinek --- In that case we should separate *.md patterns which are used for C conversions from the patterns used by the intrinsics, keep using what we are right now for the former and either use UNSPEC (the quickest but not best code generating variant) or something more complex. If UNSPEC is used, we could get some constant folding back by adding gimple_fold handling for those and the like of __builtin_ia32_psubusb128 or __builtin_ia32_pminub128. For __builtin_ia32_cvttps2dq etc. obviously the folding should either punt folding if some argument is out of range or fold those to what the hw does.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #8 from Sergei Trofimovich --- Thank you, Jakub! > The reason the testcase FAILs is the same as in the other PRs, it is trying > to convert {0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f} V4SFmode vector > to V4SImode, and because the backend sees the constant operand of the fix, it > folds it to the unspecified value as with scalar conversion. To be super-clear: the problem is the out-of-range value, not just any V4SFmode->V4SImode, right? Specifically, float32{INT32_MAX} -> int32_t should be fine, right? I was trying to extract the following example (and likely failed): It tries very hard not to pass anything outside float32{INT32_MAX} to (a different) `PromoteTo()` at the end of the function from https://github.com/google/highway/blob/2270e77d0d0ccc1d6bc7393f0ebb0b6352ddfd00/hwy/ops/x86_128-inl.h#L10275 HWY_API VFromD PromoteTo(D di64, VFromD> v) { const Rebind di32; const RebindToFloat df32; const RebindToUnsigned du32; const Repartition du32_as_du8; const auto exponent_adj = BitCast( du32, Min(SaturatedSub(BitCast(du32_as_du8, ShiftRight<23>(BitCast(du32, v))), BitCast(du32_as_du8, Set(du32, uint32_t{157}))), BitCast(du32_as_du8, Set(du32, uint32_t{32}; const auto adj_v = BitCast(df32, BitCast(du32, v) - ShiftLeft<23>(exponent_adj)); const auto f32_to_i32_result = ConvertTo(di32, adj_v); const auto lo64_or_mask = PromoteTo( di64, BitCast(du32, VecFromMask(di32, Eq(f32_to_i32_result, Set(di32, LimitsMax()); return Or(PromoteTo(di64, BitCast(di32, f32_to_i32_result)) << PromoteTo(di64, exponent_adj), lo64_or_mask); } Specifically `const auto f32_to_i32_result = ConvertTo(di32, adj_v);` hits overflow and the masking below should account for that (I tried to preserve masking in the original sample): https://github.com/google/highway/blob/2270e77d0d0ccc1d6bc7393f0ebb0b6352ddfd00/hwy/ops/x86_128-inl.h#L10870 template HWY_API VFromD ConvertTo(D di, VFromD> v) { const RebindToFloat df; // See comment at the first occurrence of "IfThenElse(overflow,". const MFromD overflow = RebindMask(di, Ge(v, Set(df, 2147483648.0f))); return IfThenElse(overflow, Set(di, LimitsMax()), ConvertInRangeTo(di, v)); } Is it obvious from the minimized C code where I got it into overflow condition? Or constant folding propagates through masks here? I'll try to re-minimize it again.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #7 from Alexander Monakov --- > the question is if the intrinsic must behave the same or if those invalid > conversions are still unspecified. I'd say it must match, the whole point of intrinsics is offering portable interface to specific instructions. If I wanted a conversion with C semantics I wouldn't need an intrinsic in the first place, it's more clearly expressed in plain C.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #6 from Jakub Jelinek --- The standard GCC behavior is that out of range floating conversions to integers result in signed integer maximum if the floating point value sign is clear and signed integer minimum otherwise (including infinities/nans).
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #5 from Jakub Jelinek --- Trying #include int main () { float f = 0x0.8p+33f; float __attribute__((vector_size (16))) vf = { 0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f }; int a = f; int __attribute__((vector_size (16))) vi = __builtin_convertvector (vf, int __attribute__((vector_size (16; int __attribute__((vector_size (16))) vi2 = (int __attribute__((vector_size (16 _mm_cvttps_epi32 ((__m128) vf); __builtin_printf ("%d\n", a); __builtin_printf ("{%d, %d, %d, %d}\n", vi[0], vi[1], vi[2], vi[3]); __builtin_printf ("{%d, %d, %d, %d}\n", vi2[0], vi2[1], vi2[2], vi2[3]); } with gcc and clang both with -O0 and -O2, this prints -2147483648 {-2147483648, -2147483648, -2147483648, -2147483648} {-2147483648, -2147483648, -2147483648, -2147483648} with both compilers at -O0, clang at -O2 prints -1720305736 {8524448, 0, 1, 135169} {-2147483648, -2147483648, -2147483648, -2147483648} i.e. complete garbage for the non-intrinsic cases, but what the HW does for the intrinsic cases, while gcc at -O2 prints 2147483647 {2147483647, 2147483647, 2147483647, 2147483647} {2147483647, 2147483647, 2147483647, 2147483647} i.e. treats intrinsics and non-intrinsics the same. GCC behaves this way consistently since GCC 9 (before __builtin_convertvector hasn't been supported), but if the vi related lines are commented out, even GCC 4.7 works that way.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #4 from Jakub Jelinek --- (In reply to Sergei Trofimovich from comment #3) > Looking at -O2's bug.cc.265t.optimized tree optimizations come up with > unfolded saturated sub8: > > _12 = __builtin_ia32_psubusb128 ({ -65, 0, 0, 0, -65, 0, 0, 0, -65, 0, 0, > 0, -65, 0, 0, 0 }, { -99, 0, 0, 0, -99, 0, 0, 0, -99, 0, 0, 0, -99, 0, 0, 0 > }); > _13 = __builtin_ia32_pminub128 (_12, { 32, 0, 0, 0, 32, 0, 0, 0, 32, 0, 0, > 0, 32, 0, 0, 0 }); > ... > > > bug.cc.272r.cse1 still has that subtraction: > > 5: r119:V16QI=[`*.LC0'] > REG_EQUAL const_vector > 6: r120:V16QI=[`*.LC1'] > REG_EQUAL const_vector > 7: r118:V16QI=us_minus(r119:V16QI,r120:V16QI) > > bug.cc.273r.fwprop1 does not anymore: > > 3: NOTE_INSN_BASIC_BLOCK 2 > 2: NOTE_INSN_FUNCTION_BEG > 9: r122:V16QI=[`*.LC2'] > REG_EQUAL const_vector >13: r123:V4SI=r122:V16QI#0<<0x17 > REG_EQUAL const_vector >16: r128:SI=0x5f80 >15: r127:V4SI=vec_duplicate(r128:SI) > > Could it be that constant folder "forgot" to generate anything for > unsupported saturated-sub instead of leaving it as is? No. It is normal constant folding on RTL (not done on GIMPLE because the i386 backend doesn't try to gimple fold __builtin_ia32_psubusb128 or __builtin_ia32_psubusb128). 0xbf - 0x9d is 0x22, so the us_minus works actually in this case exactly like minus and because 0x20 is smaller than that, the minimum is a vector with 0x20 elements (plus min (0 - 0, 0) = 0 elements). The reason the testcase FAILs is the same as in the other PRs, it is trying to convert {0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f} V4SFmode vector to V4SImode, and because the backend sees the constant operand of the fix, it folds it to the unspecified value as with scalar conversion. Consider: int main () { volatile float f = 0x0.8p+33f; volatile float __attribute__((vector_size (16))) vf = { 0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f, 0x0.8p+33f }; int a = f; int __attribute__((vector_size (16))) vi = __builtin_convertvector (vf, int __attribute__((vector_size (16; __builtin_printf ("%d\n", a); __builtin_printf ("{%d, %d, %d, %d}\n", vi[0], vi[1], vi[2], vi[3]); } This prints -2147483648 {-2147483648, -2147483648, -2147483648, -2147483648} at -O0 or -O2, but with -O2 -Dvolatile= prints 2147483647 {2147483647, 2147483647, 2147483647, 2147483647} instead. Either is IMHO fine, the C standard doesn't specify what should be the result of the conversion. Now, whether for _mm_cvttps_epi32 etc. such cases are also unspecified or not is debatable. The Intel spec obviously specifies what the CPU instructions do even in those otherwise unspecified cases, the question is if the intrinsic must behave the same or if those invalid conversions are still unspecified. If they'd be well defined when using the intrinsics, arguably the backend shouldn't use FIX RTL but some UNSPEC, or should use the FIX RTL conditionally (if_then_else:SI (argument_is_in_bounds) (fix arg) (const_int 0x800)).
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 Richard Biener changed: What|Removed |Added Target Milestone|--- |15.0 Target||x86_64-*-* Keywords||wrong-code Version|14.0|15.0
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 --- Comment #3 from Sergei Trofimovich --- Looking at -O2's bug.cc.265t.optimized tree optimizations come up with unfolded saturated sub8: _12 = __builtin_ia32_psubusb128 ({ -65, 0, 0, 0, -65, 0, 0, 0, -65, 0, 0, 0, -65, 0, 0, 0 }, { -99, 0, 0, 0, -99, 0, 0, 0, -99, 0, 0, 0, -99, 0, 0, 0 }); _13 = __builtin_ia32_pminub128 (_12, { 32, 0, 0, 0, 32, 0, 0, 0, 32, 0, 0, 0, 32, 0, 0, 0 }); ... bug.cc.272r.cse1 still has that subtraction: 5: r119:V16QI=[`*.LC0'] REG_EQUAL const_vector 6: r120:V16QI=[`*.LC1'] REG_EQUAL const_vector 7: r118:V16QI=us_minus(r119:V16QI,r120:V16QI) bug.cc.273r.fwprop1 does not anymore: 3: NOTE_INSN_BASIC_BLOCK 2 2: NOTE_INSN_FUNCTION_BEG 9: r122:V16QI=[`*.LC2'] REG_EQUAL const_vector 13: r123:V4SI=r122:V16QI#0<<0x17 REG_EQUAL const_vector 16: r128:SI=0x5f80 15: r127:V4SI=vec_duplicate(r128:SI) Could it be that constant folder "forgot" to generate anything for unsupported saturated-sub instead of leaving it as is?
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 Roger Sayle changed: What|Removed |Added Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2024-05-20 --- Comment #2 from Roger Sayle --- I can confirm that I can reproduce this and see the same thing. Adding vi tmp1 = Set_i32(INT32_MAX); d_i("tmp1",tmp1.raw); at multiple places in bug.cc, reveals that sometimes the result is the correct [0x7ff x 4], and at other places is the incorrect [0x8000 x 4], even though this affected snippet doesn't involve binary operation simplification.
[Bug target/115161] [15 Regression] highway-1.0.7 miscompilation of some SSE2 intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115161 Sergei Trofimovich changed: What|Removed |Added CC||roger at nextmovesoftware dot com, ||sayle at gcc dot gnu.org --- Comment #1 from Sergei Trofimovich --- Bisect landed on r15-352-gf2449b55fb2d32 commit f2449b55fb2d32fc4200667ba79847db31f6530d Author: Roger Sayle Date: Thu May 9 22:45:54 2024 +0100 Constant fold {-1,-1} << 1 in simplify-rtx.cc But I'm not sure if it's the real cause or the code moved around enough. The sample is somewhat fragile and resisted minimization.