[Bug rtl-optimization/91154] [10 Regression] 456.hmmer regression on Haswell caused by r272922

rguenth at gcc dot gnu.org Wed, 17 Jul 2019 03:06:50 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91154


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|needs-bisection             |
             Status|ASSIGNED                    |NEW
          Component|tree-optimization           |rtl-optimization
           Assignee|rguenth at gcc dot gnu.org         |unassigned at gcc dot 
gnu.org
            Summary|[10 Regression] 456.hmmer   |[10 Regression] 456.hmmer
                   |regression on Haswell       |regression on Haswell
                   |                            |caused by r272922

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
For the Jul 1st regression the DOM change causes us to recognize MIN/MAX
patterns
way earlier (they are exposed by conditional store elimination).  While before
the change we have to if-convert 5 loops of P7Viterbi after the change only
one is remaining.  This naturally affects the scalar fallback.  perf shows
(base is r272921, peak is r272922)

  41.37%        611991  hmmer_peak.amd6  hmmer_peak.amd64-m64-gcc42-nn  [.]
P7Viterbi
  29.84%        441484  hmmer_base.amd6  hmmer_base.amd64-m64-gcc42-nn  [.]
P7Viterbi
  15.32%        226646  hmmer_base.amd6  hmmer_base.amd64-m64-gcc42-nn  [.]
P7Viterbi.cold
   6.74%         99771  hmmer_peak.amd6  hmmer_peak.amd64-m64-gcc42-nn  [.]
P7Viterbi.cold

so the cold part is way faster but somehow the hot part quite a bit slower,
I suspect profile changes in the end, sums are 668130 vs. 711762 samples.
-fopt-info-loop doesn't show any changes, vectorizer cost-model estimates
(and thus runtime niter checks) are the same.

Note the Jul 1st regression is in the range PR90911 was present which should
be the observed Jun 13th regression (r272239).

perf points to a sequence of two cmovs being slow compared to one cmove
and a conditional branch.  Fast:

       │            dc[k] = dc[k-1] + tpdd[k-1];                                
  7.42 │ fd0:   mov    -0x80(%rbp),%rbx                                         
  0.08 │        add    -0x8(%rbx,%rax,4),%ecx                                   
       │            if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;         
  0.24 │        mov    -0xa0(%rbp),%rbx                                         
       │            dc[k] = dc[k-1] + tpdd[k-1];                                
  0.07 │        mov    %ecx,-0x4(%rdx,%rax,4)                                   
       │            if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;         
  7.41 │        mov    -0x8(%r15,%rax,4),%esi                                   
  2.69 │        add    -0x8(%rbx,%rax,4),%esi                                   
  3.31 │        cmp    %ecx,%esi                                                
  2.85 │        cmovge %esi,%ecx                                                
 12.17 │        mov    %ecx,-0x4(%rdx,%rax,4)                                   
       │            if (dc[k] < -INFTY) dc[k] = -INFTY;                         
  7.43 │        cmp    $0xc521974f,%ecx                                         
       │        jge    1060                                                     
  0.01 │        movl   $0xc521974f,-0x4(%rdx,%rax,4)                            
       │          for (k = 1; k <= M; k++) {                                    
  0.02 │        mov    %eax,%esi                                                
  0.02 │        inc    %rax                                                     
  0.01 │        cmp    %rax,%rdi                                                
       │        je     106e                                                     
       │            if (dc[k] < -INFTY) dc[k] = -INFTY;                         
  0.01 │        mov    $0xc521974f,%ecx                                         
  0.01 │        jmp    fd0                                
...
  0.00 │1060:   mov    %eax,%esi                                                
  0.03 │        inc    %rax                                                     
  0.00 │        cmp    %rdi,%rax                                                
  0.00 │        jne    fd0 
  0.06 │106e:   cmp    -0x64(%rbp),%esi                                         
       │        jg     1508                  

slow:

       │            dc[k] = dc[k-1] + tpdd[k-1];                                
  0.06 │13b0:   add    -0x8(%r9,%rcx,4),%eax                                    
  0.07 │        mov    %eax,-0x4(%r13,%rcx,4)                                   
       │            if ((sc = mc[k-1] + tpmd[k-1]) > dc[k]) dc[k] = sc;         
  6.81 │        mov    -0x8(%r8,%rcx,4),%esi                                    
  0.04 │        add    -0x8(%rdx,%rcx,4),%esi                                   
  1.46 │        cmp    %eax,%esi                                                
  0.29 │        cmovge %esi,%eax                                                
       │          for (k = 1; k <= M; k++) {                                    
 13.43 │        mov    %ecx,%esi                                                
  0.05 │        cmp    $0xc521974f,%eax                                         
  6.78 │        cmovl  %ebx,%eax                                                
 13.53 │        mov    %eax,-0x4(%r13,%rcx,4)                                   
  6.81 │        inc    %rcx                                                     
  0.03 │        cmp    %rcx,%rdi                                                
       │        jne    13b0                                                  

where on GIMPLE we have mostly MAX_EXPR <..., -987654321> for the slow case
and a conditional statement in the fast case.  Of course(?) a jump should
be well predicted (and can be speculated [correctly]) while the data dependence
on the earlier access is not.

Thus in the end this is a RTL expansion / if-conversion or backend costing
issue.  The following heuristic^Whack fixes the regression for me:

Index: gcc/expr.c
===================================================================
--- gcc/expr.c  (revision 272922)
+++ gcc/expr.c  (working copy)
@@ -9159,7 +9159,9 @@ expand_expr_real_2 (sepops ops, rtx targ
          }

        /* Use a conditional move if possible.  */
-       if (can_conditionally_move_p (mode))
+       if ((TREE_CODE (treeop1) != INTEGER_CST
+            || !wi::eq_p (wi::to_widest (treeop1), -987654321))
+           && can_conditionally_move_p (mode))
          {
            rtx insn;

Index: gcc/ifcvt.c
===================================================================
--- gcc/ifcvt.c (revision 272922)
+++ gcc/ifcvt.c (working copy)
@@ -3570,6 +3570,13 @@ noce_process_if_block (struct noce_if_in
        which do a good enough job these days.  */
     return FALSE;

+  rtx_insn *earliest;
+  cond = noce_get_alt_condition (if_info, if_info->a, &earliest);
+  if (cond
+      && CONST_INT_P (XEXP (cond, 1))
+      && INTVAL (XEXP (cond, 1)) == -987654321)
+    return FALSE;
+
   if (noce_try_move (if_info))
     goto success;
   if (noce_try_ifelse_collapse (if_info))

[Bug rtl-optimization/91154] [10 Regression] 456.hmmer regression on Haswell caused by r272922

Reply via email to