Re: [PATCH] RFC: Extend SLP permutation optimisations

2022-08-04 Thread Richard Sandiford via Gcc-patches
Richard Biener  writes:
> On Tue, 2 Aug 2022, Richard Sandiford wrote:
>
>> Currently SLP tries to force permute operations "down" the graph
>> from loads in the hope of reducing the total number of permutes
>> needed or (in the best case) removing the need for the permutes
>> entirely.  This patch tries to extend it as follows:
>> 
>> - Allow loads to take a different permutation from the one they
>>   started with, rather than choosing between "original permutation"
>>   and "no permutation".
>> 
>> - Allow changes in both directions, if the target supports the
>>   reverse permute operation.
>> 
>> - Treat the placement of permute operations as a two-way dataflow
>>   problem: after propagating information from leaves to roots (as now),
>>   propagate information back up the graph.
>> 
>> - Take execution frequency into account when optimising for speed,
>>   so that (for example) permutes inside loops have a higher cost
>>   than permutes outside loops.
>> 
>> - Try to reduce the total number of permutes when optimising for
>>   size, even if that increases the number of permutes on a given
>>   execution path.
>> 
>> See the big block comment above vect_optimize_slp_pass for
>> a detailed description.
>> 
>> A while back I posted a patch to extend the existing optimisation
>> to non-consecutive loads.  This patch doesn't include that change
>> (although it's a possible future addition).
>> 
>> The original motivation for doing this was to add a framework that would
>> allow other layout differences in future.  The two main ones are:
>> 
>> - Make it easier to represent predicated operations, including
>>   predicated operations with gaps.  E.g.:
>> 
>>  a[0] += 1;
>>  a[1] += 1;
>>  a[3] += 1;
>> 
>>   could be a single load/add/store for SVE.  We could handle this
>>   by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
>>   (depending on what's being counted).  We might need to move
>>   elements between lanes at various points, like with permutes.
>> 
>>   (This would first mean adding support for stores with gaps.)
>> 
>> - Make it easier to switch between an even/odd and unpermuted layout
>>   when switching between wide and narrow elements.  E.g. if a widening
>>   operation produces an even vector and an odd vector, we should try
>>   to keep operations on the wide elements in that order rather than
>>   force them to be permuted back "in order".
>> 
>> To give some examples of what the current patch does:
>> 
>> int f1(int *__restrict a, int *__restrict b, int *__restrict c,
>>int *__restrict d)
>> {
>>   a[0] = (b[1] << c[3]) - d[1];
>>   a[1] = (b[0] << c[2]) - d[0];
>>   a[2] = (b[3] << c[1]) - d[3];
>>   a[3] = (b[2] << c[0]) - d[2];
>> }
>> 
>> continues to produce the same code as before when optimising for
>> speed: b, c and d are permuted at load time.  But when optimising
>> for size we instead permute c into the same order as b+d and then
>> permute the result of the arithmetic into the same order as a:
>> 
>> ldr q1, [x2]
>> ldr q0, [x1]
>> ext v1.16b, v1.16b, v1.16b, #8 // <--
>> sshlv0.4s, v0.4s, v1.4s
>> ldr q1, [x3]
>> sub v0.4s, v0.4s, v1.4s
>> rev64   v0.4s, v0.4s   // <--
>> str q0, [x0]
>> ret
>> 
>> The following function:
>> 
>> int f2(int *__restrict a, int *__restrict b, int *__restrict c,
>>int *__restrict d)
>> {
>>   a[0] = (b[3] << c[3]) - d[3];
>>   a[1] = (b[2] << c[2]) - d[2];
>>   a[2] = (b[1] << c[1]) - d[1];
>>   a[3] = (b[0] << c[0]) - d[0];
>> }
>> 
>> continues to push the reverse down to just before the store,
>> like the current code does.
>> 
>> In:
>> 
>> int f3(int *__restrict a, int *__restrict b, int *__restrict c,
>>int *__restrict d)
>> {
>>   for (int i = 0; i < 100; ++i)
>> {
>>   a[0] = (a[0] + c[3]);
>>   a[1] = (a[1] + c[2]);
>>   a[2] = (a[2] + c[1]);
>>   a[3] = (a[3] + c[0]);
>>   c += 4;
>> }
>> }
>> 
>> the loads of a are hoisted and the stores of a are sunk, so that
>> only the load from c happens in the loop.  When optimising for
>> speed, we prefer to have the loop operate on the reversed layout,
>> changing on entry and exit from the loop:
>> 
>> mov x3, x0
>> adrpx0, .LC0
>> add x1, x2, 1600
>> ldr q2, [x0, #:lo12:.LC0]
>> ldr q0, [x3]
>> mov v1.16b, v0.16b
>> tbl v0.16b, {v0.16b - v1.16b}, v2.16b// <
>> .p2align 3,,7
>> .L6:
>> ldr q1, [x2], 16
>> add v0.4s, v0.4s, v1.4s
>> cmp x2, x1
>> bne .L6
>> mov v1.16b, v0.16b
>> adrpx0, .LC0
>> ldr q2, [x0, #:lo12:.LC0]
>> tbl v0.16b, {v0.16b - v1.16b}, v2.16b// <
>> str q0, [x3]
>> ret
>> 
>> Similarly, for the very artificial testcase:
>> 
>> int f4(int *__rest

Re: [PATCH] RFC: Extend SLP permutation optimisations

2022-08-03 Thread Richard Biener via Gcc-patches
On Tue, 2 Aug 2022, Richard Sandiford wrote:

> Currently SLP tries to force permute operations "down" the graph
> from loads in the hope of reducing the total number of permutes
> needed or (in the best case) removing the need for the permutes
> entirely.  This patch tries to extend it as follows:
> 
> - Allow loads to take a different permutation from the one they
>   started with, rather than choosing between "original permutation"
>   and "no permutation".
> 
> - Allow changes in both directions, if the target supports the
>   reverse permute operation.
> 
> - Treat the placement of permute operations as a two-way dataflow
>   problem: after propagating information from leaves to roots (as now),
>   propagate information back up the graph.
> 
> - Take execution frequency into account when optimising for speed,
>   so that (for example) permutes inside loops have a higher cost
>   than permutes outside loops.
> 
> - Try to reduce the total number of permutes when optimising for
>   size, even if that increases the number of permutes on a given
>   execution path.
> 
> See the big block comment above vect_optimize_slp_pass for
> a detailed description.
> 
> A while back I posted a patch to extend the existing optimisation
> to non-consecutive loads.  This patch doesn't include that change
> (although it's a possible future addition).
> 
> The original motivation for doing this was to add a framework that would
> allow other layout differences in future.  The two main ones are:
> 
> - Make it easier to represent predicated operations, including
>   predicated operations with gaps.  E.g.:
> 
>  a[0] += 1;
>  a[1] += 1;
>  a[3] += 1;
> 
>   could be a single load/add/store for SVE.  We could handle this
>   by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
>   (depending on what's being counted).  We might need to move
>   elements between lanes at various points, like with permutes.
> 
>   (This would first mean adding support for stores with gaps.)
> 
> - Make it easier to switch between an even/odd and unpermuted layout
>   when switching between wide and narrow elements.  E.g. if a widening
>   operation produces an even vector and an odd vector, we should try
>   to keep operations on the wide elements in that order rather than
>   force them to be permuted back "in order".
> 
> To give some examples of what the current patch does:
> 
> int f1(int *__restrict a, int *__restrict b, int *__restrict c,
>int *__restrict d)
> {
>   a[0] = (b[1] << c[3]) - d[1];
>   a[1] = (b[0] << c[2]) - d[0];
>   a[2] = (b[3] << c[1]) - d[3];
>   a[3] = (b[2] << c[0]) - d[2];
> }
> 
> continues to produce the same code as before when optimising for
> speed: b, c and d are permuted at load time.  But when optimising
> for size we instead permute c into the same order as b+d and then
> permute the result of the arithmetic into the same order as a:
> 
> ldr q1, [x2]
> ldr q0, [x1]
> ext v1.16b, v1.16b, v1.16b, #8 // <--
> sshlv0.4s, v0.4s, v1.4s
> ldr q1, [x3]
> sub v0.4s, v0.4s, v1.4s
> rev64   v0.4s, v0.4s   // <--
> str q0, [x0]
> ret
> 
> The following function:
> 
> int f2(int *__restrict a, int *__restrict b, int *__restrict c,
>int *__restrict d)
> {
>   a[0] = (b[3] << c[3]) - d[3];
>   a[1] = (b[2] << c[2]) - d[2];
>   a[2] = (b[1] << c[1]) - d[1];
>   a[3] = (b[0] << c[0]) - d[0];
> }
> 
> continues to push the reverse down to just before the store,
> like the current code does.
> 
> In:
> 
> int f3(int *__restrict a, int *__restrict b, int *__restrict c,
>int *__restrict d)
> {
>   for (int i = 0; i < 100; ++i)
> {
>   a[0] = (a[0] + c[3]);
>   a[1] = (a[1] + c[2]);
>   a[2] = (a[2] + c[1]);
>   a[3] = (a[3] + c[0]);
>   c += 4;
> }
> }
> 
> the loads of a are hoisted and the stores of a are sunk, so that
> only the load from c happens in the loop.  When optimising for
> speed, we prefer to have the loop operate on the reversed layout,
> changing on entry and exit from the loop:
> 
> mov x3, x0
> adrpx0, .LC0
> add x1, x2, 1600
> ldr q2, [x0, #:lo12:.LC0]
> ldr q0, [x3]
> mov v1.16b, v0.16b
> tbl v0.16b, {v0.16b - v1.16b}, v2.16b// <
> .p2align 3,,7
> .L6:
> ldr q1, [x2], 16
> add v0.4s, v0.4s, v1.4s
> cmp x2, x1
> bne .L6
> mov v1.16b, v0.16b
> adrpx0, .LC0
> ldr q2, [x0, #:lo12:.LC0]
> tbl v0.16b, {v0.16b - v1.16b}, v2.16b// <
> str q0, [x3]
> ret
> 
> Similarly, for the very artificial testcase:
> 
> int f4(int *__restrict a, int *__restrict b, int *__restrict c,
>int *__restrict d)
> {
>   int a0 = a[0];
>   int a1 = a[1];
>   int a2 = a[2];
>   int a3 = a[3];
>   for (i

[PATCH] RFC: Extend SLP permutation optimisations

2022-08-02 Thread Richard Sandiford via Gcc-patches
Currently SLP tries to force permute operations "down" the graph
from loads in the hope of reducing the total number of permutes
needed or (in the best case) removing the need for the permutes
entirely.  This patch tries to extend it as follows:

- Allow loads to take a different permutation from the one they
  started with, rather than choosing between "original permutation"
  and "no permutation".

- Allow changes in both directions, if the target supports the
  reverse permute operation.

- Treat the placement of permute operations as a two-way dataflow
  problem: after propagating information from leaves to roots (as now),
  propagate information back up the graph.

- Take execution frequency into account when optimising for speed,
  so that (for example) permutes inside loops have a higher cost
  than permutes outside loops.

- Try to reduce the total number of permutes when optimising for
  size, even if that increases the number of permutes on a given
  execution path.

See the big block comment above vect_optimize_slp_pass for
a detailed description.

A while back I posted a patch to extend the existing optimisation
to non-consecutive loads.  This patch doesn't include that change
(although it's a possible future addition).

The original motivation for doing this was to add a framework that would
allow other layout differences in future.  The two main ones are:

- Make it easier to represent predicated operations, including
  predicated operations with gaps.  E.g.:

 a[0] += 1;
 a[1] += 1;
 a[3] += 1;

  could be a single load/add/store for SVE.  We could handle this
  by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 }
  (depending on what's being counted).  We might need to move
  elements between lanes at various points, like with permutes.

  (This would first mean adding support for stores with gaps.)

- Make it easier to switch between an even/odd and unpermuted layout
  when switching between wide and narrow elements.  E.g. if a widening
  operation produces an even vector and an odd vector, we should try
  to keep operations on the wide elements in that order rather than
  force them to be permuted back "in order".

To give some examples of what the current patch does:

int f1(int *__restrict a, int *__restrict b, int *__restrict c,
   int *__restrict d)
{
  a[0] = (b[1] << c[3]) - d[1];
  a[1] = (b[0] << c[2]) - d[0];
  a[2] = (b[3] << c[1]) - d[3];
  a[3] = (b[2] << c[0]) - d[2];
}

continues to produce the same code as before when optimising for
speed: b, c and d are permuted at load time.  But when optimising
for size we instead permute c into the same order as b+d and then
permute the result of the arithmetic into the same order as a:

ldr q1, [x2]
ldr q0, [x1]
ext v1.16b, v1.16b, v1.16b, #8 // <--
sshlv0.4s, v0.4s, v1.4s
ldr q1, [x3]
sub v0.4s, v0.4s, v1.4s
rev64   v0.4s, v0.4s   // <--
str q0, [x0]
ret

The following function:

int f2(int *__restrict a, int *__restrict b, int *__restrict c,
   int *__restrict d)
{
  a[0] = (b[3] << c[3]) - d[3];
  a[1] = (b[2] << c[2]) - d[2];
  a[2] = (b[1] << c[1]) - d[1];
  a[3] = (b[0] << c[0]) - d[0];
}

continues to push the reverse down to just before the store,
like the current code does.

In:

int f3(int *__restrict a, int *__restrict b, int *__restrict c,
   int *__restrict d)
{
  for (int i = 0; i < 100; ++i)
{
  a[0] = (a[0] + c[3]);
  a[1] = (a[1] + c[2]);
  a[2] = (a[2] + c[1]);
  a[3] = (a[3] + c[0]);
  c += 4;
}
}

the loads of a are hoisted and the stores of a are sunk, so that
only the load from c happens in the loop.  When optimising for
speed, we prefer to have the loop operate on the reversed layout,
changing on entry and exit from the loop:

mov x3, x0
adrpx0, .LC0
add x1, x2, 1600
ldr q2, [x0, #:lo12:.LC0]
ldr q0, [x3]
mov v1.16b, v0.16b
tbl v0.16b, {v0.16b - v1.16b}, v2.16b// <
.p2align 3,,7
.L6:
ldr q1, [x2], 16
add v0.4s, v0.4s, v1.4s
cmp x2, x1
bne .L6
mov v1.16b, v0.16b
adrpx0, .LC0
ldr q2, [x0, #:lo12:.LC0]
tbl v0.16b, {v0.16b - v1.16b}, v2.16b// <
str q0, [x3]
ret

Similarly, for the very artificial testcase:

int f4(int *__restrict a, int *__restrict b, int *__restrict c,
   int *__restrict d)
{
  int a0 = a[0];
  int a1 = a[1];
  int a2 = a[2];
  int a3 = a[3];
  for (int i = 0; i < 100; ++i)
{
  a0 ^= c[0];
  a1 ^= c[1];
  a2 ^= c[2];
  a3 ^= c[3];
  c += 4;
  for (int j = 0; j < 100; ++j)
{
  a0 += d[1];
  a1 += d[0];
  a2 += d[3];
  a3 += d[2];
  d += 4;
}
  b[0] = a0;
  b[1] = a1;
  b[2] = a2;