[Bug tree-optimization/120687] RISC-V: very poor vector code gen for LMbench bw_mem test case

rguenth at gcc dot gnu.org via Gcc-bugs Mon, 20 Oct 2025 06:47:46 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120687


--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> ---
So I have a patch now that works on x86, but with RVV I see

t.c:5:12: note:   node 0x319f58a0 (max_nunits=4, refcnt=2) vector([4,4]) int
t.c:5:12: note:   op template: _14 = MEM[(int *)p_25 + 28B];
t.c:5:12: note:         stmt 0 _14 = MEM[(int *)p_25 + 28B];
t.c:5:12: note:         stmt 1 _12 = MEM[(int *)p_25 + 24B];
t.c:5:12: note:         stmt 2 _10 = MEM[(int *)p_25 + 20B];
t.c:5:12: note:         stmt 3 _8 = MEM[(int *)p_25 + 16B];
t.c:5:12: note:         stmt 4 _6 = MEM[(int *)p_25 + 12B];
t.c:5:12: note:         stmt 5 _4 = MEM[(int *)p_25 + 8B];
t.c:5:12: note:         stmt 6 _1 = *p_25; 
t.c:5:12: note:         stmt 7 _2 = MEM[(int *)p_25 + 4B];
t.c:5:12: note:         load permutation { 7 6 5 4 3 2 0 1 }
...
t.c:5:12: note:  SLP optimize permutations:
t.c:5:12: note:    1: { 7, 6, 5, 4, 3, 2, 0, 1 }
t.c:5:12: note:  SLP optimize partitions:
t.c:5:12: note:    -------------
t.c:5:12: note:    partition 0 (layout 0):
t.c:5:12: note:      nodes:
t.c:5:12: note:        - 0x319f58a0:
t.c:5:12: note:            weight: 8.090909
t.c:5:12: note:            out weight: 8.090909 (degree 1)
t.c:5:12: note:            op template: _14 = MEM[(int *)p_25 + 28B];
t.c:5:12: note:      edges:
t.c:5:12: note:        - 0x319f58a0 --> [1] 0x319f5a00
t.c:5:12: note:      layout 0: (*)
t.c:5:12: note:          {depth: 0.000000, total: 0.000000}
t.c:5:12: note:        + {depth: 8.090909, total: 8.090909}
t.c:5:12: note:        + {depth: 0.000000, total: 0.000000}
t.c:5:12: note:        = {depth: 8.090909, total: 8.090909}
t.c:5:12: note:      layout 1: rejected
t.c:5:12: note:    -------------
t.c:5:12: note:    partition 1 (layout 0):
t.c:5:12: note:      nodes:
t.c:5:12: note:        - 0x319f5a00:
t.c:5:12: note:            weight: 8.090909
t.c:5:12: note:            op template: sum_21 = _15 + sum_26;
t.c:5:12: note:        - 0x319f5950:
t.c:5:12: note:            weight: 8.090909
t.c:5:12: note:            op template: sum_26 = PHI <sum_21(6), 0(5)>
t.c:5:12: note:      edges:
t.c:5:12: note:        - 0x319f58a0 [0] --> 0x319f5a00
t.c:5:12: note:      layout 0: (*)
t.c:5:12: note:          {depth: 8.090909, total: 8.090909}
t.c:5:12: note:        + {depth: 0.000000, total: 0.000000}
t.c:5:12: note:        + {depth: 0.000000, total: 0.000000}
t.c:5:12: note:        = {depth: 8.090909, total: 8.090909}
t.c:5:12: note:      layout 1: rejected
...
t.c:5:12: note:   ==> examining statement: _14 = MEM[(int *)p_25 + 28B];
t.c:5:12: missed:   permutation not supported, using elementwise access
...
t.c:5:12: note:  re-trying with single-lane SLP

so permute optimization on RVV cannot elide the load permutation.

It seems this is because vect_optimize_slp_pass::change_layout_cost
in one way or another locally computes whether the weird permutation
can be code generated.  This happens during the forward pass already,
it seems we require the permute to be materializable at each node instead
of only at the final point where it can be absorbed by the reduction
operation itself or in the more general case, by another permute, turning
into something supportable?  I guess this is what the comment already says:

                    /* Reject the layout if it would make layout 0 impossible
                       for later partitions.  This amounts to testing that the
                       target supports reversing the layout change on edges
                       to later partitions.

                       In principle, it might be possible to push a layout
                       change all the way down a graph, so that it never
                       needs to be reversed and so that the target doesn't
                       need to support the reverse operation.  But it would
                       be awkward to bail out if we hit a partition that
                       does not support the new layout, especially since
                       we are not dealing with a lattice.  */
                    is_possible &= edge_layout_cost (ud, other_node_i, 0,
                                                     layout_i).is_possible ();

[Bug tree-optimization/120687] RISC-V: very poor vector code gen for LMbench bw_mem test case

Reply via email to