https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772
Bug ID: 98772
Summary: Widening patterns causing missed vectorization
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: joelh at gcc dot gnu.org
Target Milestone: ---
Disabling widening patterns (widening_mult, widening_plus, widening_minus)
allows some testcases to be vectorized better. Currently mixed scalar and
vector code is produced, due to the patterns being recognized and substituted
but vectorization failing 'no optab'. When they are recognized 16bytes -> 16
shorts, using a pair 8byte->8short instructions is presumed, the datatypes
chosen in 'vectorizable_conversion' are 'vectype_in' 8 bytes, 'vectype out' 8
shorts. This causes the scalar code to be emitted where these patterns were
recognized.
For the following testcases with: gcc -O3
#include <stdint.h>
extern void wdiff( int16_t d[16], uint8_t *restrict pix1, uint8_t *restrict
pix2)
{
for( int y = 0; y < 4; y++ )
{
for( int x = 0; x < 4; x++ )
d[x + y*4] = pix1[x] * pix2[x];
pix1 += 16;
pix2 += 16;
}
The following output is seen, processing 8 elements per cycle using scalar
instructions and 8 elements per cycle using vector instructions.
wdiff:
.LFB0:
.cfi_startproc
ldrb w3, [x1, 32]
ldrb w6, [x2, 32]
ldrb w8, [x1, 33]
ldrb w5, [x2, 33]
ldrb w4, [x1, 34]
mul w3, w3, w6
ldrb w7, [x1, 35]
fmov s0, w3
ldrb w3, [x2, 34]
mul w8, w8, w5
ldrb w9, [x2, 35]
ldrb w6, [x2, 48]
ldrb w5, [x1, 49]
ins v0.h[1], w8
mul w3, w4, w3
mul w7, w7, w9
ldrb w4, [x1, 48]
ldrb w8, [x2, 49]
ldrb w9, [x2, 50]
ins v0.h[2], w3
ldrb w3, [x1, 51]
mul w6, w6, w4
ldrb w4, [x1, 50]
mul w5, w5, w8
ldrb w8, [x2, 51]
ldr d2, [x1]
ins v0.h[3], w7
ldr d1, [x2]
mul w4, w4, w9
ldr d4, [x1, 16]
ldr d3, [x2, 16]
mul w1, w3, w8
ins v0.h[4], w6
zip1 v2.2s, v2.2s, v4.2s
zip1 v1.2s, v1.2s, v3.2s
ins v0.h[5], w5
umull v1.8h, v1.8b, v2.8b
ins v0.h[6], w4
ins v0.h[7], w1
stp q1, q0, [x0]
ret
if the widening multiply instruction is disabled e.g.:
- { vect_recog_widen_mult_pattern, "widen_mult" },
+ //{ vect_recog_widen_mult_pattern, "widen_mult" },
in tree-vect-patterns.c
then the same testcase is able to process 16 elements per cycle using vector
instructions.
wdiff:
.LFB0:
.cfi_startproc
ldr b3, [x1, 33]
ldr b2, [x2, 33]
ldr b1, [x1, 32]
ldr b0, [x2, 32]
ldr b5, [x1, 34]
ins v1.b[1], v3.b[0]
ldr b4, [x2, 34]
ins v0.b[1], v2.b[0]
ldr b3, [x1, 35]
ldr b2, [x2, 35]
ldr b19, [x1, 48]
ins v1.b[2], v5.b[0]
ldr b17, [x2, 48]
ins v0.b[2], v4.b[0]
ldr b18, [x1, 49]
ldr b16, [x2, 49]
ldr b7, [x1, 50]
ins v1.b[3], v3.b[0]
ldr b6, [x2, 50]
ins v0.b[3], v2.b[0]
ldr b5, [x1, 51]
ldr b4, [x2, 51]
ldr d3, [x1]
ins v1.b[4], v19.b[0]
ldr d2, [x2]
ins v0.b[4], v17.b[0]
ldr d19, [x1, 16]
ldr d17, [x2, 16]
ins v1.b[5], v18.b[0]
zip1 v3.2s, v3.2s, v19.2s
ins v0.b[5], v16.b[0]
zip1 v2.2s, v2.2s, v17.2s
ins v1.b[6], v7.b[0]
umull v2.8h, v2.8b, v3.8b
ins v0.b[6], v6.b[0]
ins v1.b[7], v5.b[0]
ins v0.b[7], v4.b[0]
umull v0.8h, v0.8b, v1.8b
stp q2, q0, [x0]
ret
.cfi_endproc
note the use of 2 umull instructions.
The same can be seen for widening plus and widening minus.
It appears to be due to the way than the vectype_in is chosen in vectorizable
conversion,
in vectorizable conversion, tree-vect-stmts.c:4626
vect_is_simple_use fills the &vectype1_in parameter, which fills the vectype_in
parameter.
during slp vectorization vect_is_simple_use uses the slp tree vectype:
tree-vect-stmts.c:
11369 if (slp_node)
11370 {
11371 slp_tree child = SLP_TREE_CHILDREN (slp_node)[operand]; |
11372 *slp_def = child;
11373 *vectype = SLP_TREE_VECTYPE (child);
11374 if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
11375 { | |11376 *op = gimple_get_lhs (SLP_TREE_REPRESENTATIVE (child)->stmt);
| |11377 return vect_is_simple_use (*op, vinfo, dt, def_stmt_info_out); |
|11378 }
for 'vect' vectorization, the def_stmt_info is used.