https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords|needs-bisection | Status|ASSIGNED |NEW Component|tree-optimization |rtl-optimization Assignee|rguenth at gcc dot gnu.org |unassigned at gcc dot gnu.org Summary|[10 Regression] 456.hmmer |[10 Regression] 456.hmmer |regression on Haswell |regression on Haswell | |caused by r272922 --- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- For the Jul 1st regression the DOM change causes us to recognize MIN/MAX patterns way earlier (they are exposed by conditional store elimination). While before the change we have to if-convert 5 loops of P7Viterbi after the change only one is remaining. This naturally affects the scalar fallback. perf shows (base is r272921, peak is r272922) 41.37% 611991 hmmer_peak.amd6 hmmer_peak.amd64-m64-gcc42-nn [.] P7Viterbi 29.84% 441484 hmmer_base.amd6 hmmer_base.amd64-m64-gcc42-nn [.] P7Viterbi 15.32% 226646 hmmer_base.amd6 hmmer_base.amd64-m64-gcc42-nn [.] P7Viterbi.cold 6.74% 99771 hmmer_peak.amd6 hmmer_peak.amd64-m64-gcc42-nn [.] P7Viterbi.cold so the cold part is way faster but somehow the hot part quite a bit slower, I suspect profile changes in the end, sums are 668130 vs. 711762 samples. -fopt-info-loop doesn't show any changes, vectorizer cost-model estimates (and thus runtime niter checks) are the same. Note the Jul 1st regression is in the range PR90911 was present which should be the observed Jun 13th regression (r272239). perf points to a sequence of two cmovs being slow compared to one cmove and a conditional branch. Fast: │ dc[k] = dc[k-1] + tpdd[k-1]; 7.42 │ fd0: mov -0x80(%rbp),%rbx 0.08 │ add -0x8(%rbx,%rax,4),%ecx │ if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; 0.24 │ mov -0xa0(%rbp),%rbx │ dc[k] = dc[k-1] + tpdd[k-1]; 0.07 │ mov %ecx,-0x4(%rdx,%rax,4) │ if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; 7.41 │ mov -0x8(%r15,%rax,4),%esi 2.69 │ add -0x8(%rbx,%rax,4),%esi 3.31 │ cmp %ecx,%esi 2.85 │ cmovge %esi,%ecx 12.17 │ mov %ecx,-0x4(%rdx,%rax,4) │ if (dc[k] < -INFTY) dc[k] = -INFTY; 7.43 │ cmp $0xc521974f,%ecx │ jge 1060 0.01 │ movl $0xc521974f,-0x4(%rdx,%rax,4) │ for (k = 1; k <= M; k++) { 0.02 │ mov %eax,%esi 0.02 │ inc %rax 0.01 │ cmp %rax,%rdi │ je 106e │ if (dc[k] < -INFTY) dc[k] = -INFTY; 0.01 │ mov $0xc521974f,%ecx 0.01 │ jmp fd0 ... 0.00 │1060: mov %eax,%esi 0.03 │ inc %rax 0.00 │ cmp %rdi,%rax 0.00 │ jne fd0 0.06 │106e: cmp -0x64(%rbp),%esi │ jg 1508 slow: │ dc[k] = dc[k-1] + tpdd[k-1]; 0.06 │13b0: add -0x8(%r9,%rcx,4),%eax 0.07 │ mov %eax,-0x4(%r13,%rcx,4) │ if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc; 6.81 │ mov -0x8(%r8,%rcx,4),%esi 0.04 │ add -0x8(%rdx,%rcx,4),%esi 1.46 │ cmp %eax,%esi 0.29 │ cmovge %esi,%eax │ for (k = 1; k <= M; k++) { 13.43 │ mov %ecx,%esi 0.05 │ cmp $0xc521974f,%eax 6.78 │ cmovl %ebx,%eax 13.53 │ mov %eax,-0x4(%r13,%rcx,4) 6.81 │ inc %rcx 0.03 │ cmp %rcx,%rdi │ jne 13b0 where on GIMPLE we have mostly MAX_EXPR <..., -987654321> for the slow case and a conditional statement in the fast case. Of course(?) a jump should be well predicted (and can be speculated [correctly]) while the data dependence on the earlier access is not. Thus in the end this is a RTL expansion / if-conversion or backend costing issue. The following heuristic^Whack fixes the regression for me: Index: gcc/expr.c =================================================================== --- gcc/expr.c (revision 272922) +++ gcc/expr.c (working copy) @@ -9159,7 +9159,9 @@ expand_expr_real_2 (sepops ops, rtx targ } /* Use a conditional move if possible. */ - if (can_conditionally_move_p (mode)) + if ((TREE_CODE (treeop1) != INTEGER_CST + || !wi::eq_p (wi::to_widest (treeop1), -987654321)) + && can_conditionally_move_p (mode)) { rtx insn; Index: gcc/ifcvt.c =================================================================== --- gcc/ifcvt.c (revision 272922) +++ gcc/ifcvt.c (working copy) @@ -3570,6 +3570,13 @@ noce_process_if_block (struct noce_if_in which do a good enough job these days. */ return FALSE; + rtx_insn *earliest; + cond = noce_get_alt_condition (if_info, if_info->a, &earliest); + if (cond + && CONST_INT_P (XEXP (cond, 1)) + && INTVAL (XEXP (cond, 1)) == -987654321) + return FALSE; + if (noce_try_move (if_info)) goto success; if (noce_try_ifelse_collapse (if_info))