Re: [PATCH] RFC: Extend SLP permutation optimisations
Richard Biener writes: > On Tue, 2 Aug 2022, Richard Sandiford wrote: > >> Currently SLP tries to force permute operations "down" the graph >> from loads in the hope of reducing the total number of permutes >> needed or (in the best case) removing the need for the permutes >> entirely. This patch tries to extend it as follows: >> >> - Allow loads to take a different permutation from the one they >> started with, rather than choosing between "original permutation" >> and "no permutation". >> >> - Allow changes in both directions, if the target supports the >> reverse permute operation. >> >> - Treat the placement of permute operations as a two-way dataflow >> problem: after propagating information from leaves to roots (as now), >> propagate information back up the graph. >> >> - Take execution frequency into account when optimising for speed, >> so that (for example) permutes inside loops have a higher cost >> than permutes outside loops. >> >> - Try to reduce the total number of permutes when optimising for >> size, even if that increases the number of permutes on a given >> execution path. >> >> See the big block comment above vect_optimize_slp_pass for >> a detailed description. >> >> A while back I posted a patch to extend the existing optimisation >> to non-consecutive loads. This patch doesn't include that change >> (although it's a possible future addition). >> >> The original motivation for doing this was to add a framework that would >> allow other layout differences in future. The two main ones are: >> >> - Make it easier to represent predicated operations, including >> predicated operations with gaps. E.g.: >> >> a[0] += 1; >> a[1] += 1; >> a[3] += 1; >> >> could be a single load/add/store for SVE. We could handle this >> by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 } >> (depending on what's being counted). We might need to move >> elements between lanes at various points, like with permutes. >> >> (This would first mean adding support for stores with gaps.) >> >> - Make it easier to switch between an even/odd and unpermuted layout >> when switching between wide and narrow elements. E.g. if a widening >> operation produces an even vector and an odd vector, we should try >> to keep operations on the wide elements in that order rather than >> force them to be permuted back "in order". >> >> To give some examples of what the current patch does: >> >> int f1(int *__restrict a, int *__restrict b, int *__restrict c, >>int *__restrict d) >> { >> a[0] = (b[1] << c[3]) - d[1]; >> a[1] = (b[0] << c[2]) - d[0]; >> a[2] = (b[3] << c[1]) - d[3]; >> a[3] = (b[2] << c[0]) - d[2]; >> } >> >> continues to produce the same code as before when optimising for >> speed: b, c and d are permuted at load time. But when optimising >> for size we instead permute c into the same order as b+d and then >> permute the result of the arithmetic into the same order as a: >> >> ldr q1, [x2] >> ldr q0, [x1] >> ext v1.16b, v1.16b, v1.16b, #8 // <-- >> sshlv0.4s, v0.4s, v1.4s >> ldr q1, [x3] >> sub v0.4s, v0.4s, v1.4s >> rev64 v0.4s, v0.4s // <-- >> str q0, [x0] >> ret >> >> The following function: >> >> int f2(int *__restrict a, int *__restrict b, int *__restrict c, >>int *__restrict d) >> { >> a[0] = (b[3] << c[3]) - d[3]; >> a[1] = (b[2] << c[2]) - d[2]; >> a[2] = (b[1] << c[1]) - d[1]; >> a[3] = (b[0] << c[0]) - d[0]; >> } >> >> continues to push the reverse down to just before the store, >> like the current code does. >> >> In: >> >> int f3(int *__restrict a, int *__restrict b, int *__restrict c, >>int *__restrict d) >> { >> for (int i = 0; i < 100; ++i) >> { >> a[0] = (a[0] + c[3]); >> a[1] = (a[1] + c[2]); >> a[2] = (a[2] + c[1]); >> a[3] = (a[3] + c[0]); >> c += 4; >> } >> } >> >> the loads of a are hoisted and the stores of a are sunk, so that >> only the load from c happens in the loop. When optimising for >> speed, we prefer to have the loop operate on the reversed layout, >> changing on entry and exit from the loop: >> >> mov x3, x0 >> adrpx0, .LC0 >> add x1, x2, 1600 >> ldr q2, [x0, #:lo12:.LC0] >> ldr q0, [x3] >> mov v1.16b, v0.16b >> tbl v0.16b, {v0.16b - v1.16b}, v2.16b// < >> .p2align 3,,7 >> .L6: >> ldr q1, [x2], 16 >> add v0.4s, v0.4s, v1.4s >> cmp x2, x1 >> bne .L6 >> mov v1.16b, v0.16b >> adrpx0, .LC0 >> ldr q2, [x0, #:lo12:.LC0] >> tbl v0.16b, {v0.16b - v1.16b}, v2.16b// < >> str q0, [x3] >> ret >> >> Similarly, for the very artificial testcase: >> >> int f4(int *__rest
Re: [PATCH] RFC: Extend SLP permutation optimisations
On Tue, 2 Aug 2022, Richard Sandiford wrote: > Currently SLP tries to force permute operations "down" the graph > from loads in the hope of reducing the total number of permutes > needed or (in the best case) removing the need for the permutes > entirely. This patch tries to extend it as follows: > > - Allow loads to take a different permutation from the one they > started with, rather than choosing between "original permutation" > and "no permutation". > > - Allow changes in both directions, if the target supports the > reverse permute operation. > > - Treat the placement of permute operations as a two-way dataflow > problem: after propagating information from leaves to roots (as now), > propagate information back up the graph. > > - Take execution frequency into account when optimising for speed, > so that (for example) permutes inside loops have a higher cost > than permutes outside loops. > > - Try to reduce the total number of permutes when optimising for > size, even if that increases the number of permutes on a given > execution path. > > See the big block comment above vect_optimize_slp_pass for > a detailed description. > > A while back I posted a patch to extend the existing optimisation > to non-consecutive loads. This patch doesn't include that change > (although it's a possible future addition). > > The original motivation for doing this was to add a framework that would > allow other layout differences in future. The two main ones are: > > - Make it easier to represent predicated operations, including > predicated operations with gaps. E.g.: > > a[0] += 1; > a[1] += 1; > a[3] += 1; > > could be a single load/add/store for SVE. We could handle this > by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 } > (depending on what's being counted). We might need to move > elements between lanes at various points, like with permutes. > > (This would first mean adding support for stores with gaps.) > > - Make it easier to switch between an even/odd and unpermuted layout > when switching between wide and narrow elements. E.g. if a widening > operation produces an even vector and an odd vector, we should try > to keep operations on the wide elements in that order rather than > force them to be permuted back "in order". > > To give some examples of what the current patch does: > > int f1(int *__restrict a, int *__restrict b, int *__restrict c, >int *__restrict d) > { > a[0] = (b[1] << c[3]) - d[1]; > a[1] = (b[0] << c[2]) - d[0]; > a[2] = (b[3] << c[1]) - d[3]; > a[3] = (b[2] << c[0]) - d[2]; > } > > continues to produce the same code as before when optimising for > speed: b, c and d are permuted at load time. But when optimising > for size we instead permute c into the same order as b+d and then > permute the result of the arithmetic into the same order as a: > > ldr q1, [x2] > ldr q0, [x1] > ext v1.16b, v1.16b, v1.16b, #8 // <-- > sshlv0.4s, v0.4s, v1.4s > ldr q1, [x3] > sub v0.4s, v0.4s, v1.4s > rev64 v0.4s, v0.4s // <-- > str q0, [x0] > ret > > The following function: > > int f2(int *__restrict a, int *__restrict b, int *__restrict c, >int *__restrict d) > { > a[0] = (b[3] << c[3]) - d[3]; > a[1] = (b[2] << c[2]) - d[2]; > a[2] = (b[1] << c[1]) - d[1]; > a[3] = (b[0] << c[0]) - d[0]; > } > > continues to push the reverse down to just before the store, > like the current code does. > > In: > > int f3(int *__restrict a, int *__restrict b, int *__restrict c, >int *__restrict d) > { > for (int i = 0; i < 100; ++i) > { > a[0] = (a[0] + c[3]); > a[1] = (a[1] + c[2]); > a[2] = (a[2] + c[1]); > a[3] = (a[3] + c[0]); > c += 4; > } > } > > the loads of a are hoisted and the stores of a are sunk, so that > only the load from c happens in the loop. When optimising for > speed, we prefer to have the loop operate on the reversed layout, > changing on entry and exit from the loop: > > mov x3, x0 > adrpx0, .LC0 > add x1, x2, 1600 > ldr q2, [x0, #:lo12:.LC0] > ldr q0, [x3] > mov v1.16b, v0.16b > tbl v0.16b, {v0.16b - v1.16b}, v2.16b// < > .p2align 3,,7 > .L6: > ldr q1, [x2], 16 > add v0.4s, v0.4s, v1.4s > cmp x2, x1 > bne .L6 > mov v1.16b, v0.16b > adrpx0, .LC0 > ldr q2, [x0, #:lo12:.LC0] > tbl v0.16b, {v0.16b - v1.16b}, v2.16b// < > str q0, [x3] > ret > > Similarly, for the very artificial testcase: > > int f4(int *__restrict a, int *__restrict b, int *__restrict c, >int *__restrict d) > { > int a0 = a[0]; > int a1 = a[1]; > int a2 = a[2]; > int a3 = a[3]; > for (i
[PATCH] RFC: Extend SLP permutation optimisations
Currently SLP tries to force permute operations "down" the graph from loads in the hope of reducing the total number of permutes needed or (in the best case) removing the need for the permutes entirely. This patch tries to extend it as follows: - Allow loads to take a different permutation from the one they started with, rather than choosing between "original permutation" and "no permutation". - Allow changes in both directions, if the target supports the reverse permute operation. - Treat the placement of permute operations as a two-way dataflow problem: after propagating information from leaves to roots (as now), propagate information back up the graph. - Take execution frequency into account when optimising for speed, so that (for example) permutes inside loops have a higher cost than permutes outside loops. - Try to reduce the total number of permutes when optimising for size, even if that increases the number of permutes on a given execution path. See the big block comment above vect_optimize_slp_pass for a detailed description. A while back I posted a patch to extend the existing optimisation to non-consecutive loads. This patch doesn't include that change (although it's a possible future addition). The original motivation for doing this was to add a framework that would allow other layout differences in future. The two main ones are: - Make it easier to represent predicated operations, including predicated operations with gaps. E.g.: a[0] += 1; a[1] += 1; a[3] += 1; could be a single load/add/store for SVE. We could handle this by representing a layout such as { 0, 1, _, 2 } or { 0, 1, _, 3 } (depending on what's being counted). We might need to move elements between lanes at various points, like with permutes. (This would first mean adding support for stores with gaps.) - Make it easier to switch between an even/odd and unpermuted layout when switching between wide and narrow elements. E.g. if a widening operation produces an even vector and an odd vector, we should try to keep operations on the wide elements in that order rather than force them to be permuted back "in order". To give some examples of what the current patch does: int f1(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { a[0] = (b[1] << c[3]) - d[1]; a[1] = (b[0] << c[2]) - d[0]; a[2] = (b[3] << c[1]) - d[3]; a[3] = (b[2] << c[0]) - d[2]; } continues to produce the same code as before when optimising for speed: b, c and d are permuted at load time. But when optimising for size we instead permute c into the same order as b+d and then permute the result of the arithmetic into the same order as a: ldr q1, [x2] ldr q0, [x1] ext v1.16b, v1.16b, v1.16b, #8 // <-- sshlv0.4s, v0.4s, v1.4s ldr q1, [x3] sub v0.4s, v0.4s, v1.4s rev64 v0.4s, v0.4s // <-- str q0, [x0] ret The following function: int f2(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { a[0] = (b[3] << c[3]) - d[3]; a[1] = (b[2] << c[2]) - d[2]; a[2] = (b[1] << c[1]) - d[1]; a[3] = (b[0] << c[0]) - d[0]; } continues to push the reverse down to just before the store, like the current code does. In: int f3(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { for (int i = 0; i < 100; ++i) { a[0] = (a[0] + c[3]); a[1] = (a[1] + c[2]); a[2] = (a[2] + c[1]); a[3] = (a[3] + c[0]); c += 4; } } the loads of a are hoisted and the stores of a are sunk, so that only the load from c happens in the loop. When optimising for speed, we prefer to have the loop operate on the reversed layout, changing on entry and exit from the loop: mov x3, x0 adrpx0, .LC0 add x1, x2, 1600 ldr q2, [x0, #:lo12:.LC0] ldr q0, [x3] mov v1.16b, v0.16b tbl v0.16b, {v0.16b - v1.16b}, v2.16b// < .p2align 3,,7 .L6: ldr q1, [x2], 16 add v0.4s, v0.4s, v1.4s cmp x2, x1 bne .L6 mov v1.16b, v0.16b adrpx0, .LC0 ldr q2, [x0, #:lo12:.LC0] tbl v0.16b, {v0.16b - v1.16b}, v2.16b// < str q0, [x3] ret Similarly, for the very artificial testcase: int f4(int *__restrict a, int *__restrict b, int *__restrict c, int *__restrict d) { int a0 = a[0]; int a1 = a[1]; int a2 = a[2]; int a3 = a[3]; for (int i = 0; i < 100; ++i) { a0 ^= c[0]; a1 ^= c[1]; a2 ^= c[2]; a3 ^= c[3]; c += 4; for (int j = 0; j < 100; ++j) { a0 += d[1]; a1 += d[0]; a2 += d[3]; a3 += d[2]; d += 4; } b[0] = a0; b[1] = a1; b[2] = a2;