[Bug tree-optimization/113678] SLP misses up vec_concat

2024-02-06 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113678

--- Comment #3 from Andrew Pinski  ---
Note the SLP that happens in connection with the loop vectorizer actually does
a decent job ...

[Bug tree-optimization/113678] SLP misses up vec_concat

2024-02-06 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113678

--- Comment #2 from Andrew Pinski  ---
Noticed the same with:
```
void f(unsigned char *a, unsigned char *b, unsigned char *c)
{
  unsigned char t[8];
  t[0] = a[0];
  t[1] = a[1];
  t[2] = a[2];
  t[3] = a[3];
  t[4] = b[0];
  t[5] = b[1];
  t[6] = b[2];
  t[7] = b[3];
  c[0] = t[0];
  c[1] = t[1];
  c[2] = t[2];
  c[3] = t[3];
  c[4] = t[4];
  c[5] = t[5];
  c[6] = t[6];
  c[7] = t[7];
}

```

Adding `-fno-tree-vectorize` gives the best code even.

[Bug tree-optimization/113678] SLP misses up vec_concat

2024-01-31 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113678

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2024-01-31
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

--- Comment #1 from Richard Biener  ---
I think the SLP tree we discover is sound:

t2.c:11:14: note:   node 0x5db76f0 (max_nunits=8, refcnt=2) vector(8) char
t2.c:11:14: note:   op template: *a_7(D) = _1;
t2.c:11:14: note:   stmt 0 *a_7(D) = _1;
t2.c:11:14: note:   stmt 1 MEM[(char *)a_7(D) + 1B] = _2;
t2.c:11:14: note:   stmt 2 MEM[(char *)a_7(D) + 2B] = _3;
t2.c:11:14: note:   stmt 3 MEM[(char *)a_7(D) + 3B] = _4;
t2.c:11:14: note:   stmt 4 MEM[(char *)a_7(D) + 4B] = _1;
t2.c:11:14: note:   stmt 5 MEM[(char *)a_7(D) + 5B] = _2;
t2.c:11:14: note:   stmt 6 MEM[(char *)a_7(D) + 6B] = _3;
t2.c:11:14: note:   stmt 7 MEM[(char *)a_7(D) + 7B] = _4;
t2.c:11:14: note:   children 0x5db7778
t2.c:11:14: note:   node 0x5db7778 (max_nunits=8, refcnt=2) vector(8) char
t2.c:11:14: note:   op template: _1 = *b_6(D);
t2.c:11:14: note:   stmt 0 _1 = *b_6(D);
t2.c:11:14: note:   stmt 1 _2 = MEM[(char *)b_6(D) + 1B];
t2.c:11:14: note:   stmt 2 _3 = MEM[(char *)b_6(D) + 2B];
t2.c:11:14: note:   stmt 3 _4 = MEM[(char *)b_6(D) + 3B];
t2.c:11:14: note:   stmt 4 _1 = *b_6(D);
t2.c:11:14: note:   stmt 5 _2 = MEM[(char *)b_6(D) + 1B];
t2.c:11:14: note:   stmt 6 _3 = MEM[(char *)b_6(D) + 2B];
t2.c:11:14: note:   stmt 7 _4 = MEM[(char *)b_6(D) + 3B];
t2.c:11:14: note:   load permutation { 0 1 2 3 0 1 2 3 }

the issue is as so often

t2.c:11:14: note:   ==> examining statement: _1 = *b_6(D);
t2.c:11:14: missed:   BB vectorization with gaps at the end of a load is not
supported
t2.c:3:19: missed:   not vectorized: relevant stmt not supported: _1 = *b_6(D);
t2.c:11:14: note:   Building vector operands of 0x5db7778 from scalars instead

where we are not applying much non-ad-hoc work to deal with those
"out-of-bound" accesses.  The choice here would be obvious in doing
a single vector(4) load instead.