[Bug tree-optimization/114339] [14 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model since r14-6822

2024-03-14 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339

--- Comment #9 from Andrew Pinski  ---
Reduced testcase:
```
#define vect128 __attribute__((vector_size(16)))

[[gnu::noinline]]
vect128 long f(vect128 long a)
{
return a <= (vect128 long){0, 9223372036854775807};
}

int main()
{
  vect128 long t = (vect128 long){0, 0};
  t = f(t);
  if (!t[1])
  __builtin_abort();

}
```

[Bug tree-optimization/114339] [14 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model since r14-6822

2024-03-14 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339

--- Comment #8 from Jakub Jelinek  ---
Slightly simplified/cleaned up testcase:
/* { dg-do run } */
/* { dg-options "-O2 -fno-vect-cost-model" } */
/* { dg-additional-options "-mavx" { target avx_runtime } } */

struct S { int a; long b; int c; } s;

__attribute__((noipa)) struct S *
foo (void)
{
  return 
}

int
main ()
{
  struct S r = *foo ();
  long t = 10412095L - r.b;
  struct { long d; int e; } f[4] = { {}, {}, {}, { __LONG_MAX__, 0 } };
  for (unsigned i = 0; i < 4; ++i)
if (t <= f[i].d)
  return f[i].e;
  __builtin_abort ();
}

[Bug tree-optimization/114339] [14 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model since r14-6822

2024-03-14 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339

--- Comment #7 from Andrew Pinski  ---

This looks wrong:
```
;; mask_patt_17.15_55 = vect_cst__53 <= { 0, 9223372036854775807 };

(insn 21 20 22 (set (reg:V2DI 111)
(mem/u/c:V2DI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0  S16 A128]))
-1
 (expr_list:REG_EQUAL (const_vector:V2DI [
(const_int 1 [0x1])
(const_int -9223372036854775808 [0x8000])
])
(nil)))

(insn 22 21 23 (set (reg:V2DI 112)
(gt:V2DI (reg:V2DI 111)
(reg:V2DI 100 [ vect_cst__53 ]))) -1
 (nil))

(insn 23 22 0 (set (reg:V2DI 102 [ mask_patt_17.15D.3121 ])
(reg:V2DI 112)) -1
 (nil))
```

we go from `a <= INT_MAX` to `INT_MIN > a` which is basically going from `true`
to `a != INT_MIN`.

[Bug tree-optimization/114339] [14 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model since r14-6822

2024-03-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339

--- Comment #6 from Tamar Christina  ---
vectorizer generates:

  mask_patt_21.19_58 = vect_perm_even_49 >= vect_cst__57;
  mask_patt_21.19_59 = vect_perm_even_55 >= vect_cst__57;
  vexit_reduc_63 = mask_patt_21.19_58 | mask_patt_21.19_59;
  if (vexit_reduc_63 != { 0, 0 })
goto ; [20.00%]
  else
goto ; [80.00%]

This is changed at loopdone into:

  delays[3].nonprimary_delay = 129600;
  vect_cst__57 = {tdiff_6, tdiff_6};
  mask_patt_21.19_58 = vect_cst__57 <= { 0, 0 };
  mask_patt_21.19_59 = vect_cst__57 <= { 0, 0x7FFF };
  vexit_reduc_63 = mask_patt_21.19_58 | mask_patt_21.19_59;
  if (vexit_reduc_63 != { 0, 0 })
goto ; [20.00%]
  else
goto ; [80.00%]

or in other words, if there's any value where the compare succeeds, find it and
return.
This looks correct to me.

It could be that my AVX is rusty but, this generates:

   vmovdqa 0xf9c(%rip),%xmm1# 0x402010 
   mov$0x1,%eax
   vmovq  %rcx,%xmm3
   vmovdqa %xmm0,(%rsp)
   vpunpcklqdq %xmm3,%xmm3,%xmm2
   vmovdqa %xmm0,0x10(%rsp)
   vpcmpgtq %xmm2,%xmm1,%xmm1#
   vmovdqa %xmm0,0x20(%rsp)
   vmovq  %rax,%xmm0
   vpunpcklqdq %xmm0,%xmm0,%xmm0
   movl   $0x1fa40,0x38(%rsp)
   vpcmpgtq %xmm2,%xmm0,%xmm0
   vpor   %xmm1,%xmm0,%xmm0
   vptest %xmm0,%xmm0

which looks off, particularly for the second compare it look like it doesn't do
a load but instead just duplicates the constant 1.
gdb seems to confirm this. At the first compare:

(gdb) p $xmm2.v2_int64
$4 = {10412095, 10412095}
(gdb) p $xmm0.v2_int64
$5 = {0, 0}

which is what's expected, but at the second compare:

(gdb) p $xmm2.v2_int64
$7 = {10412095, 10412095}
(gdb) p $xmm0.v2_int64
$6 = {1, 1}

at the second it's comparing {1, 1} instead of {0, 0x7FFF}.

on AArch64 where it doesn't fail the comparison is:

   moviv29.4s, 0
   add x1, sp, 16
   ldr x5, [x0, 8]
   mov w0, 64064
   movkw0, 0x1, lsl 16
   add x3, sp, 48
   str q29, [sp, 64]
   mov x2, 57407
   mov x4, 9223372036854775807
   str x4, [sp, 64]
   movkx2, 0x9e, lsl 16
   str w0, [sp, 72]
   sub x2, x2, x5
   stp q29, q29, [x1]
   dup v27.2d, x2
   ld2 {v30.2d - v31.2d}, [x1]
   str q29, [sp, 48]
   ld2 {v28.2d - v29.2d}, [x3]
   cmgev30.2d, v30.2d, v27.2d
   cmgev28.2d, v28.2d, v27.2d
   orr v30.16b, v30.16b, v28.16b
   umaxp   v30.4s, v30.4s, v30.4s
   fmovx0, d30
   cbnzx0, .L12

which has v30.2d being {0, 0} and v28.2d being {0, 0x7FFF} as
expected...

On AArch64 we don't inline the constants because whatever is propagating the
constants can't understand the LOAD_LANES:

  mask_patt_19.21_50 = vect__2.16_44 >= vect_cst__49;
  mask_patt_19.21_51 = vect__2.19_47 >= vect_cst__49;
  vexit_reduc_55 = mask_patt_19.21_50 | mask_patt_19.21_51;
  if (vexit_reduc_55 != { 0, 0 })
goto ; [20.00%]
  else
goto ; [80.00%]

so could this be another expansion bug?

Note that a simpler reproducer is this:

---
long tdiff = 10412095;

int main() {
  struct {
long maximum;
int nonprimary_delay;
  } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}};

  for (unsigned i = 0; i < sizeof(delays) / sizeof(delays[0]); ++i)
if (tdiff <= delays[i].maximum)
  return delays[i].nonprimary_delay;

  __builtin_abort();
}
---

the key point is that we're not allowed to constprop tdiff at GIMPLE. If we do,
e.g:

int main() {
  struct {
long maximum;
int nonprimary_delay;
  } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}};
  long tdiff = 10412095;

  for (unsigned i = 0; i < sizeof(delays) / sizeof(delays[0]); ++i)
if (tdiff <= delays[i].maximum)
  return delays[i].nonprimary_delay;

  __builtin_abort();
}

then after vectorization the const prop the entire expression is evaluated at
GIMPLE and it gets the right result.

This makes me believe it's a target expansion bug.

[Bug tree-optimization/114339] [14 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model since r14-6822

2024-03-14 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339

Jakub Jelinek  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
Summary|[14 regression] Tor |[14 regression] Tor
   |miscompiled with -O2 -mavx  |miscompiled with -O2 -mavx
   |-fno-vect-cost-model|-fno-vect-cost-model since
   ||r14-6822
 Ever confirmed|0   |1
   Priority|P3  |P1
   Target Milestone|--- |14.0
   Last reconfirmed||2024-03-14
 CC||jakub at gcc dot gnu.org

--- Comment #5 from Jakub Jelinek  ---
Started with r14-6822-g01f4251b8775c832a92d55e2df57c9ac72eaceef

[Bug tree-optimization/114339] [14 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model

2024-03-14 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339

--- Comment #4 from Sam James  ---
(In reply to Sam James from comment #3)
> Created attachment 57705 [details]
> larger.i
> 
> Ah, wait, that might be a bad reduction. Let me attach a larger one, then I
> can give the original if needed too.

OK, no, I think https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339#c2 is fine.

[Bug tree-optimization/114339] [14 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model

2024-03-14 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339

--- Comment #3 from Sam James  ---
Created attachment 57705
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57705=edit
larger.i

Ah, wait, that might be a bad reduction. Let me attach a larger one, then I can
give the original if needed too.

[Bug tree-optimization/114339] [14 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model

2024-03-14 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339

Sam James  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #2 from Sam James  ---
I've minimised it to:
```
struct entry_guard_t {
  int is_reachable;
  long failing_since;
  int is_primary;
} *create_guard() {
  struct entry_guard_t *guard = __builtin_malloc(sizeof *guard);
  guard->is_reachable = guard->failing_since = guard->is_primary = 0;
  return guard;
}
int main() {
  struct entry_guard_t guard = *create_guard();
  long tdiff = 10412095 - guard.failing_since;
  struct {
long maximum;
int nonprimary_delay;
  } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}};
  unsigned i = 0;
  for (; i < sizeof(delays) / sizeof(delays[0]); ++i)
if (tdiff <= delays[i].maximum)
  return delays[i].nonprimary_delay;
  __builtin_abort();
}
```

This fails for me with `-O2 -mavx -fno-vect-cost-model`.

[Bug tree-optimization/114339] [14 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model

2024-03-14 Thread sjames at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339

--- Comment #1 from Sam James  ---
The assert is at
https://gitlab.torproject.org/tpo/core/tor/-/blob/tor-0.4.8.10/src/feature/client/entrynodes.c#L2072

```
(gdb) p delays
$3 = {{
maximum = 21600,
primary_delay = 600,
nonprimary_delay = 3600
  }, {
maximum = 345600,
primary_delay = 5400,
nonprimary_delay = 14400
  }, {
maximum = 604800,
primary_delay = 14400,
nonprimary_delay = 64800
  }, {
maximum = 9223372036854775807,
primary_delay = 32400,
nonprimary_delay = 129600
  }}
(gdb)
```

The bad loop (confirmed w/ novector pragma) is:
  for (i = 0; i < ARRAY_LENGTH(delays); ++i) {
if (tdiff <= delays[i].maximum) {
  return is_primary ? delays[i].primary_delay : delays[i].nonprimary_delay;
}
  }