[Bug tree-optimization/125303] vector operation before shuffle produces unvectorized shuffle

tnfchris at gcc dot gnu.org via Gcc-bugs Mon, 18 May 2026 23:59:15 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125303


--- Comment #7 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
(In reply to [email protected] from comment #6)
> On Mon, 18 May 2026, tnfchris at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125303
> >
> > --- Comment #5 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #2)
> >> (In reply to Drea Pinski from comment #1)
> >>> Confirmed.
> >>>
> >>> The difference comes from who does the expansion/combining.
> >>>
> >>> So vec_shuf is handled by vectorizer SLP pass.
> >>>
> >>> While vec_xor_shuf is handled by forwprop1 and veclowering pass.
> >>>
> >>> The veclowering pass does not handle:
> >>>   _15 = VEC_PERM_EXPR <_2, _2, { 4, 0, 5, 1, 6, 2, 7, 3 }>;
> >>>
> >>> in a partial wise and only handles scalar wise at this stage. Nobody has
> >>> improved it yet to try to use partial vector sizes.
> >>
> >> The idea is that the re-vectorization patches from Tamar should address 
> >> this
> >> (we'll have to trigger extra BB SLP seeds of course)
> >
> > I've been looking into this today to see what's required here, as my current
> > patches
> > allow growing of the VF but not shrinking. The extra seed is easy enough to 
> > do:
> >
> > @@ -10000,6 +10003,24 @@ vect_slp_check_for_roots (bb_vec_info bb_vinfo)
> >                }
> >            }
> >        }
> > +#if 1
> > +       else if (gimple_vdef (assign)
> > +              && VECTOR_TYPE_P (TREE_TYPE (rhs))
> > +              && ((op = optab_for_tree_code (code, TREE_TYPE (rhs),
> > optab_default)) == unknown_optab
> > +                  || !can_implement_p (op, TYPE_MODE (TREE_TYPE (rhs)))))
> > +       {
> > +         vec<stmt_vec_info> roots = vNULL;
> > +         auto stmt_vinfo = bb_vinfo->lookup_stmt (assign);
> > +         roots.safe_push (stmt_vinfo);
> > +         vec<stmt_vec_info> stmts;
> > +         auto num_entries = TYPE_VECTOR_SUBPARTS (TREE_TYPE 
> > (rhs)).to_constant
> > ();
> > +         stmts.create (num_entries);
> > +         for (int i = 0; i < num_entries; i++)
> > +           stmts.quick_push (stmt_vinfo);
> > +         bb_vinfo->roots.safe_push (slp_root (slp_inst_kind_store,
> > +                                              stmts, roots));
> > +       }
> > +#endif
> >     }
> > }
> >
> > Which triggers a change in data_ref analysis since we don't allow loads 
> > groups
> > with zero step during BB vectorization.
> > Updating that hits the largest road block and as expected that's build_slp
> > analysis.
> >
> > Because the group size is larger than any supported vector type it fails.
> > I changed it to pick the largest possible vectype and then if analysis 
> > succeeds
> > would set unroll fact.
> 
> I'm a bit confused, it should be fine to pick a smaller vector type if the
> group size is a multiple of it.  The unroll factor (VF) is always 1 in
> BB vectorization, but yes, "ncopies" can be > 1 iff the vector type is
> smaller than the group size.
> 

Yeah I came to the same conclusion and did that below.  Which I think failed
because I hacked it in and the query for vectype was inconsistent.


> >
> > However there are way too many things doing analysis as we expected so it 
> > still
> > fails:
> >
> > note:   === vect_analyze_data_ref_accesses ===
> > note:   === vect_determine_precisions ===
> > note:   === vect_pattern_recog ===
> > note:   === vect_analyze_slp ===
> > note:   Starting SLP discovery for
> > note:     *x_5(D) = _15;
> > note:     *x_5(D) = _15;
> > note:     *x_5(D) = _15;
> > note:     *x_5(D) = _15;
> > note:     *x_5(D) = _15;
> > note:     *x_5(D) = _15;
> > note:     *x_5(D) = _15;
> > note:     *x_5(D) = _15;
> > note:   starting SLP discovery for node 0x5eba780
> > note:   get vectype for scalar type (group size 8): unsigned int
> > note:   vectype: vector(4) unsigned int
> > note:   get vectype for smallest scalar type: vec256
> > note:   nunits vectype: vector(4) unsigned int
> > note:   nunits = 4
> > note:   Build SLP for *x_5(D) = _15;
> > note:   Build SLP for *x_5(D) = _15;
> > note:   Build SLP for *x_5(D) = _15;
> > note:   Build SLP for *x_5(D) = _15;
> > note:   Build SLP for *x_5(D) = _15;
> > note:   Build SLP for *x_5(D) = _15;
> > note:   Build SLP for *x_5(D) = _15;
> > note:   Build SLP for *x_5(D) = _15;
> > note:   SLP discovery for node 0x5eba780 failed
> > note:   SLP discovery failed
> >
> > So I'm not sure what the cleanest solution is he.. hmm maybe I'm looking at
> > this wrong...
> >
> > Since the tree is supposed to already be vectorized, perhaps I should 
> > instead
> > build the initial
> > groups using the largest supported group size and then set the unroll based 
> > on
> > that..
> 
> I think the group size has to match the vector type used in the IL
> (or be larger if we combine two vector typed operations).  But as I didn't
> yet see your re-vectorization patches I can only guess what you did to
> representation in the SLP graph ...
> 

Yeah the design is actually quite simple, which is why I was checking if it
would
adapt to the above. I'm cleaning it up now and will send an RFC end of next
week (bank
holidays this and next week).

I needed to get enough done to see how to change if-convert etc so I can
explain
the scope we need.  I think I have enough now.


> 
> create lanes according to the vector type in the IL.  Rely on
> "ncopies" for code generation.  It should work the same as you'd
> have scalar code with 16 lanes but vector type with 4 lanes.
> 
> > Note that trying to change unrolling so it matches the old VF seems to 
> > trigger
> > node splitting:
> >
> > note:   SLP discovery succeeded but node needs splitting
> >
> > which then ICEs.  But haven't looked into that yet in case this is a bad
> > approach.
> 
> That said - I think we can followup with the details but should make
> sure the design of re-vectorization is sound to handle this case since
> IMO it's trivially the "same".
> 

Ack, I'll send up a write up and RFC next week, I think I know how things
should
be handled. Well everything but data-ref analysis which I think needs a
target hook for describing intrinsics loads.

Thanks,
Tamar

> Richard.

[Bug tree-optimization/125303] vector operation before shuffle produces unvectorized shuffle

Reply via email to