https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125567

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Zhongyao Chen from comment #6)
> Created attachment 64621 [details]
> proposed patch
> 
> Proposed patch attached. If it looks good, I can post it to the mailing list.
> 
> My local tests show no regression for vect.exp, only a few for rvv.exp, 
> but those are reasonable, just need update test expectations.

Yes, something like that was what I had in mind.  It might in theory result
in less BB vectorization in the case we reject a scalar build at a distance > 1
from the operation we can swap operands in case the swap doesn't help but we'd
fail at a shorter distance then and thus end up with less vector operations and
earlier scalar build.  This is because we generally do not fully explore the
search space.  It's going to be "bad luck".  One could argue the testcase at
hand is just "bad luck" as well, of course.

That said, I don't quite understand vect_slp_commutative_retry_depth which
you increment on seeing a swappable operation, but all that matter is whether
there's one somewhere up?  So if we were to make this a boolean,

 bool old = swappable_op_upthread;
..
 if (can_swap_nonmatching)
   swappable_op_upthread = true;
 child = vect_build_slp_tree (...);
 swappable_op_upthread = old;

the behavior would be identical?  If we want to track a depth I'd say we
would want to track the distance to the most upthread swappable operation?
Or to the least upthread swappable operation?  In case we want to complicate
the heuristic because of the consideration raised above?  Aka, sth like

+         /* Skip building vector operands from scalars while operand
+            discovery may still be fixed by retrying with swapped operands. 
*/
+         && (least_upthread_swappable_operation_distance < 3
+             /* A first scalar stmt mismatch signals a fatal mismatch
+                that the parent commutative retry cannot recover.  */
+             || !matches[0])

with least_upthread_swappable_operation_distance being -1U when there is none
such?

But tracking the number of swappable operations upthread isn't of much
heuristic use?  Generally you'll easily run into the SLP discovery limit
if you re-visit too much of the SLP graph given the budget for BB
vectorization is in the order of the number of scalar stmts, so we only
get extra budget due to group_size being > 1.  When doing the swapping
runs into the discovery limit the result will be bad.  So maybe even
only consider least_upthread_swappable_operation_distance == 1 (does this
help in your case?)

Reply via email to