https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120687
--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
So I have a patch now that works on x86, but with RVV I see
t.c:5:12: note: node 0x319f58a0 (max_nunits=4, refcnt=2) vector([4,4]) int
t.c:5:12: note: op template: _14 = MEM[(int *)p_25 + 28B];
t.c:5:12: note: stmt 0 _14 = MEM[(int *)p_25 + 28B];
t.c:5:12: note: stmt 1 _12 = MEM[(int *)p_25 + 24B];
t.c:5:12: note: stmt 2 _10 = MEM[(int *)p_25 + 20B];
t.c:5:12: note: stmt 3 _8 = MEM[(int *)p_25 + 16B];
t.c:5:12: note: stmt 4 _6 = MEM[(int *)p_25 + 12B];
t.c:5:12: note: stmt 5 _4 = MEM[(int *)p_25 + 8B];
t.c:5:12: note: stmt 6 _1 = *p_25;
t.c:5:12: note: stmt 7 _2 = MEM[(int *)p_25 + 4B];
t.c:5:12: note: load permutation { 7 6 5 4 3 2 0 1 }
...
t.c:5:12: note: SLP optimize permutations:
t.c:5:12: note: 1: { 7, 6, 5, 4, 3, 2, 0, 1 }
t.c:5:12: note: SLP optimize partitions:
t.c:5:12: note: -------------
t.c:5:12: note: partition 0 (layout 0):
t.c:5:12: note: nodes:
t.c:5:12: note: - 0x319f58a0:
t.c:5:12: note: weight: 8.090909
t.c:5:12: note: out weight: 8.090909 (degree 1)
t.c:5:12: note: op template: _14 = MEM[(int *)p_25 + 28B];
t.c:5:12: note: edges:
t.c:5:12: note: - 0x319f58a0 --> [1] 0x319f5a00
t.c:5:12: note: layout 0: (*)
t.c:5:12: note: {depth: 0.000000, total: 0.000000}
t.c:5:12: note: + {depth: 8.090909, total: 8.090909}
t.c:5:12: note: + {depth: 0.000000, total: 0.000000}
t.c:5:12: note: = {depth: 8.090909, total: 8.090909}
t.c:5:12: note: layout 1: rejected
t.c:5:12: note: -------------
t.c:5:12: note: partition 1 (layout 0):
t.c:5:12: note: nodes:
t.c:5:12: note: - 0x319f5a00:
t.c:5:12: note: weight: 8.090909
t.c:5:12: note: op template: sum_21 = _15 + sum_26;
t.c:5:12: note: - 0x319f5950:
t.c:5:12: note: weight: 8.090909
t.c:5:12: note: op template: sum_26 = PHI <sum_21(6), 0(5)>
t.c:5:12: note: edges:
t.c:5:12: note: - 0x319f58a0 [0] --> 0x319f5a00
t.c:5:12: note: layout 0: (*)
t.c:5:12: note: {depth: 8.090909, total: 8.090909}
t.c:5:12: note: + {depth: 0.000000, total: 0.000000}
t.c:5:12: note: + {depth: 0.000000, total: 0.000000}
t.c:5:12: note: = {depth: 8.090909, total: 8.090909}
t.c:5:12: note: layout 1: rejected
...
t.c:5:12: note: ==> examining statement: _14 = MEM[(int *)p_25 + 28B];
t.c:5:12: missed: permutation not supported, using elementwise access
...
t.c:5:12: note: re-trying with single-lane SLP
so permute optimization on RVV cannot elide the load permutation.
It seems this is because vect_optimize_slp_pass::change_layout_cost
in one way or another locally computes whether the weird permutation
can be code generated. This happens during the forward pass already,
it seems we require the permute to be materializable at each node instead
of only at the final point where it can be absorbed by the reduction
operation itself or in the more general case, by another permute, turning
into something supportable? I guess this is what the comment already says:
/* Reject the layout if it would make layout 0 impossible
for later partitions. This amounts to testing that the
target supports reversing the layout change on edges
to later partitions.
In principle, it might be possible to push a layout
change all the way down a graph, so that it never
needs to be reversed and so that the target doesn't
need to support the reverse operation. But it would
be awkward to bail out if we hit a partition that
does not support the new layout, especially since
we are not dealing with a lattice. */
is_possible &= edge_layout_cost (ud, other_node_i, 0,
layout_i).is_possible ();