[Bug tree-optimization/98772] New: Widening patterns causing missed vectorization

joelh at gcc dot gnu.org via Gcc-bugs Wed, 20 Jan 2021 07:27:03 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98772


            Bug ID: 98772
           Summary: Widening patterns causing missed vectorization
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: joelh at gcc dot gnu.org
  Target Milestone: ---

Disabling widening patterns (widening_mult, widening_plus, widening_minus)
allows some testcases to be vectorized better. Currently mixed scalar and
vector code is produced, due to the patterns being recognized and substituted
but vectorization failing 'no optab'. When they are recognized 16bytes -> 16
shorts, using a pair 8byte->8short instructions is presumed, the datatypes
chosen in 'vectorizable_conversion' are 'vectype_in' 8 bytes, 'vectype out' 8
shorts. This causes the scalar code to be emitted where these patterns were
recognized.


For the following testcases with: gcc -O3

#include <stdint.h>
extern void wdiff( int16_t d[16], uint8_t *restrict pix1, uint8_t *restrict
pix2)
{
   for( int y = 0; y < 4; y++ )
  {    
    for( int x = 0; x < 4; x++ )
      d[x + y*4] = pix1[x] * pix2[x];
    pix1 += 16;  
    pix2 += 16;
 }

The following output is seen, processing 8 elements per cycle using scalar
instructions and 8 elements per cycle using vector instructions.

wdiff:
.LFB0:
        .cfi_startproc
        ldrb    w3, [x1, 32]
        ldrb    w6, [x2, 32]
        ldrb    w8, [x1, 33]
        ldrb    w5, [x2, 33]
        ldrb    w4, [x1, 34]
        mul     w3, w3, w6
        ldrb    w7, [x1, 35]
        fmov    s0, w3
        ldrb    w3, [x2, 34]
        mul     w8, w8, w5
        ldrb    w9, [x2, 35]
        ldrb    w6, [x2, 48]
        ldrb    w5, [x1, 49]
        ins     v0.h[1], w8
        mul     w3, w4, w3
        mul     w7, w7, w9
        ldrb    w4, [x1, 48]
        ldrb    w8, [x2, 49]
        ldrb    w9, [x2, 50]
        ins     v0.h[2], w3
        ldrb    w3, [x1, 51]
        mul     w6, w6, w4
        ldrb    w4, [x1, 50]
        mul     w5, w5, w8
        ldrb    w8, [x2, 51]
        ldr     d2, [x1]
        ins     v0.h[3], w7
        ldr     d1, [x2]
        mul     w4, w4, w9
        ldr     d4, [x1, 16]
        ldr     d3, [x2, 16]
        mul     w1, w3, w8
        ins     v0.h[4], w6
        zip1    v2.2s, v2.2s, v4.2s
        zip1    v1.2s, v1.2s, v3.2s
        ins     v0.h[5], w5
        umull   v1.8h, v1.8b, v2.8b
        ins     v0.h[6], w4
        ins     v0.h[7], w1
        stp     q1, q0, [x0]
        ret


if the widening multiply instruction is disabled e.g.:

-  { vect_recog_widen_mult_pattern, "widen_mult" },
+  //{ vect_recog_widen_mult_pattern, "widen_mult" },
in tree-vect-patterns.c

then the same testcase is able to process 16 elements per cycle using vector
instructions. 

wdiff:
.LFB0:
        .cfi_startproc
        ldr     b3, [x1, 33]
        ldr     b2, [x2, 33]
        ldr     b1, [x1, 32]
        ldr     b0, [x2, 32]
        ldr     b5, [x1, 34]
        ins     v1.b[1], v3.b[0]
        ldr     b4, [x2, 34]
        ins     v0.b[1], v2.b[0]
        ldr     b3, [x1, 35]
        ldr     b2, [x2, 35]
        ldr     b19, [x1, 48]
        ins     v1.b[2], v5.b[0]
        ldr     b17, [x2, 48]
        ins     v0.b[2], v4.b[0]
        ldr     b18, [x1, 49]
        ldr     b16, [x2, 49]
        ldr     b7, [x1, 50]
        ins     v1.b[3], v3.b[0]
        ldr     b6, [x2, 50]
        ins     v0.b[3], v2.b[0]
        ldr     b5, [x1, 51]
        ldr     b4, [x2, 51]
        ldr     d3, [x1]
        ins     v1.b[4], v19.b[0]
        ldr     d2, [x2]
        ins     v0.b[4], v17.b[0]
        ldr     d19, [x1, 16]
        ldr     d17, [x2, 16]
        ins     v1.b[5], v18.b[0]
        zip1    v3.2s, v3.2s, v19.2s
        ins     v0.b[5], v16.b[0]
        zip1    v2.2s, v2.2s, v17.2s
        ins     v1.b[6], v7.b[0]
        umull   v2.8h, v2.8b, v3.8b
        ins     v0.b[6], v6.b[0]
        ins     v1.b[7], v5.b[0]
        ins     v0.b[7], v4.b[0]
        umull   v0.8h, v0.8b, v1.8b
        stp     q2, q0, [x0]
        ret
        .cfi_endproc

note the use of 2 umull instructions.



The same can be seen for widening plus and widening minus.

It appears to be due to the way than the vectype_in is chosen in vectorizable
conversion, 

in vectorizable conversion, tree-vect-stmts.c:4626

vect_is_simple_use fills the &vectype1_in parameter, which fills the vectype_in
parameter.



during slp vectorization vect_is_simple_use uses the slp tree vectype:

tree-vect-stmts.c:
11369 if (slp_node)
11370 {
11371 slp_tree child = SLP_TREE_CHILDREN (slp_node)[operand]; |
11372 *slp_def = child;
11373 *vectype = SLP_TREE_VECTYPE (child);
11374 if (SLP_TREE_DEF_TYPE (child) == vect_internal_def)
11375 { | |11376 *op = gimple_get_lhs (SLP_TREE_REPRESENTATIVE (child)->stmt);
| |11377 return vect_is_simple_use (*op, vinfo, dt, def_stmt_info_out); |
|11378 }



for 'vect' vectorization, the def_stmt_info is used.

[Bug tree-optimization/98772] New: Widening patterns causing missed vectorization

Reply via email to