https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125567
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Zhongyao Chen from comment #6) > Created attachment 64621 [details] > proposed patch > > Proposed patch attached. If it looks good, I can post it to the mailing list. > > My local tests show no regression for vect.exp, only a few for rvv.exp, > but those are reasonable, just need update test expectations. Yes, something like that was what I had in mind. It might in theory result in less BB vectorization in the case we reject a scalar build at a distance > 1 from the operation we can swap operands in case the swap doesn't help but we'd fail at a shorter distance then and thus end up with less vector operations and earlier scalar build. This is because we generally do not fully explore the search space. It's going to be "bad luck". One could argue the testcase at hand is just "bad luck" as well, of course. That said, I don't quite understand vect_slp_commutative_retry_depth which you increment on seeing a swappable operation, but all that matter is whether there's one somewhere up? So if we were to make this a boolean, bool old = swappable_op_upthread; .. if (can_swap_nonmatching) swappable_op_upthread = true; child = vect_build_slp_tree (...); swappable_op_upthread = old; the behavior would be identical? If we want to track a depth I'd say we would want to track the distance to the most upthread swappable operation? Or to the least upthread swappable operation? In case we want to complicate the heuristic because of the consideration raised above? Aka, sth like + /* Skip building vector operands from scalars while operand + discovery may still be fixed by retrying with swapped operands. */ + && (least_upthread_swappable_operation_distance < 3 + /* A first scalar stmt mismatch signals a fatal mismatch + that the parent commutative retry cannot recover. */ + || !matches[0]) with least_upthread_swappable_operation_distance being -1U when there is none such? But tracking the number of swappable operations upthread isn't of much heuristic use? Generally you'll easily run into the SLP discovery limit if you re-visit too much of the SLP graph given the budget for BB vectorization is in the order of the number of scalar stmts, so we only get extra budget due to group_size being > 1. When doing the swapping runs into the discovery limit the result will be bad. So maybe even only consider least_upthread_swappable_operation_distance == 1 (does this help in your case?)
