https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101296
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Samples: 884K of event 'cycles:u', Event count (approx.): 967510000841 Overhead Samples Command Shared Object Symbol 13.76% 119196 milc_peak.amd64 milc_peak.amd64-m64-mine [.] u_shift_fermion # 10.08% 87085 milc_base.amd64 milc_base.amd64-m64-mine [.] add_force_to_mom # 9.93% 85891 milc_base.amd64 milc_base.amd64-m64-mine [.] u_shift_fermion # 9.38% 81331 milc_peak.amd64 milc_peak.amd64-m64-mine [.] add_force_to_mom # 9.03% 82570 milc_peak.amd64 milc_peak.amd64-m64-mine [.] mult_su3_na # 8.55% 77803 milc_base.amd64 milc_base.amd64-m64-mine [.] mult_su3_na # 7.41% 65641 milc_peak.amd64 milc_peak.amd64-m64-mine [.] mult_su3_nn # 6.26% 55314 milc_base.amd64 milc_base.amd64-m64-mine [.] mult_su3_nn # 1.48% 12876 milc_peak.amd64 milc_peak.amd64-m64-mine [.] mult_su3_an # 1.42% 12625 milc_base.amd64 milc_base.amd64-m64-mine [.] imp_gauge_force.constprop.0 # 1.18% 10602 milc_peak.amd64 milc_peak.amd64-m64-mine [.] imp_gauge_force.constprop.0 # 1.00% 8853 milc_base.amd64 milc_base.amd64-m64-mine [.] mult_su3_mat_vec_sum_4dir # 0.94% 8343 milc_peak.amd64 milc_peak.amd64-m64-mine [.] mult_su3_mat_vec_sum_4dir # 0.94% 8156 milc_base.amd64 milc_base.amd64-m64-mine [.] mult_su3_an The odd thing is that for example mult_su3_an reports vastly different amount of cycles but the assembly is 1:1 identical. There are in total 16 vaddsubpd instructions in the new variant in symbols add_force_to_mom (1) and mult_su3_nn (15) but that doesn't explain the difference seen above. There are more detected ADDSUB patterns but they do not materialize in the end, still there's some effect on RA and scheduling in functions like u_shift_fermion, but the vectorizer dumps do not reveal anything interesting for this example either. I was using the following to disable the added pattern: diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c index 2671f91972d..388b185dc7b 100644 --- a/gcc/tree-vect-slp-patterns.c +++ b/gcc/tree-vect-slp-patterns.c @@ -1510,7 +1510,7 @@ addsub_pattern::recognize (slp_tree_to_load_perm_map_t *, slp_tree *node_) { slp_tree node = *node_; if (SLP_TREE_CODE (node) != VEC_PERM_EXPR - || SLP_TREE_CHILDREN (node).length () != 2) + || SLP_TREE_CHILDREN (node).length () != 2 || 1) return NULL; /* Match a blend of a plus and a minus op with the same number of plus and To sum up - I have no idea why performance has regressed.