#1498: Optimisation: eliminate unnecessary heap check in recursive function
-------------------------------------------+--------------------------------
    Reporter:  simonmar                    |       Owner:                       
  
        Type:  bug                         |      Status:  new                  
  
    Priority:  low                         |   Milestone:  7.6.1                
  
   Component:  Compiler                    |     Version:  6.6.1                
  
    Keywords:                              |          Os:  Unknown/Multiple     
  
Architecture:  Unknown/Multiple            |     Failure:  Runtime performance 
bug
  Difficulty:  Moderate (less than a day)  |    Testcase:                       
  
   Blockedby:  4258                        |    Blocking:                       
  
     Related:                              |  
-------------------------------------------+--------------------------------

Comment(by batterseapower):

 Out of curiosity I changed CgCase to push heap checks into branches in
 more cases, according to the following diff. My patch affects those cases
 with primop scrutinees which have at least 2 alts *and* are which will not
 take a heap check before scrutinee evaluation.

 Such cases are instead treated just like algebraic cases where the alts
 are given a full info tables etc so the heap check can be done in the
 branches, *after* evaluating the scrutinee.

 Nofib results were poor: code size increased by an average of 0.1% (max
 1.7%) and allocations didn't fall perceptibly on average (they fell by
 0.2% in cacheprof though). Runtime rose by 1.8%, with a max increase of
 16.2% in wheel-sieve1.

 It is not really surprising that this costs so much when you consider the
 overhead of spilling variables to a stack-carried closure for the alts,
 and that even if you end up eliminating a heap check, you might end up
 causing a brand new stack check. It doesn't help that the code generated
 by this change tends to be particularly stupid in more obvious ways. Here
 is the inner loop of the example:

 {{{
 [section "data" {
      HC_foo_closure:
          const HC_foo_info;
  },
  sbX_ret()
          { update_frame: <none>
            label: sbX_info
            rep:StackRep [True, True, False, True]
          }
      cca:
          _ccb::I64 = R1 & 7;
          if (_ccb::I64 >= 2) goto ccc;
          _sbR::F64 = F64[I64[Sp + 24] + 16 + (I64[Sp + 16] << 3)];
          _sbU::F64 = %MO_F_Mul_W64(F64[Sp + 8], _sbR::F64);
          _sbV::I64 = I64[Sp + 16] + 1;
          R1 = _sbV::I64;
          D1 = _sbU::F64;
          Sp = Sp + 16;
          jump sbO_info; // [D1, R1]
      ccc:
          Hp = Hp + 16;
          if (Hp > HpLim) goto cci;
          I64[Hp - 8] = ghczmprim_GHCziTypes_Dzh_con_info;
          F64[Hp + 0] = F64[Sp + 8];
          R1 = ghczmprim_GHCziTuple_Z0T_closure+1;
          R2 = Hp - 7;
          Sp = Sp + 40;
          jump (I64[Sp + 0]); // [R2, R1]
      ccj: jump stg_gc_enter_1; // [R1]
      cci:
          HpAlloc = 16;
          goto ccj;
  },
  sbO_ret()
          { update_frame: <none>
            label: sbO_info
            rep:StackRep [False, True]
          }
      ccl:
          I64[Sp + 0] = R1;
          F64[Sp - 8] = D1;
          _ccn::I64 = %MO_S_Ge_W64(R1, I64[Sp + 16]);
          R1 = I64[ghczmprim_GHCziTypes_Bool_closure_tbl + (_ccn::I64 <<
 3)];
          I64[Sp - 16] = sbX_info;
          Sp = Sp - 16;
          jump sbX_info; // [R1]
  },
  HC_foo_info()
          { update_frame: <none>
            label: HC_foo_info
            rep:HeapRep static { Fun {arity: 2 fun_type: ArgSpec 11} }
          }
      ccp:
          if (Sp - 40 < SpLim) goto ccr;
          I64[Sp - 8] = R3;
          I64[Sp - 16] = R2;
          R1 = 0;
          D1 = 0.0 :: W64;
          Sp = Sp - 24;
          jump sbO_info; // [D1, R1]
      ccr:
          R1 = HC_foo_closure;
          jump stg_gc_fun; // [R1, R3, R2]
  }]
 }}}

 Note the totally unnecessary use of the
 ghczmprim_GHCziTypes_Bool_closure_tbl. (It is worth noting this same
 problem afflicts the new codegen even in HEAD.)

 Here is a possible restriction of this optimization that will still hit
 the particular example in this ticket. Note that deferring the heap check
 is a clear win when the heap check can have the GC reenter in such a way
 that it *reevaluates the scrutinee*. This is safe from a work-duplication
 perspective when the scrutinee is cheap, as "i >=# n" is. If we don't mind
 duplicating that work then we can avoid creating a stack entry for the
 alts and just reuse the existing stack entry for the whole case. In the
 CMM above, this would amount to not giving sbX_ret an info table and
 instead having stg_gc just reenter the already-present sbO_ret closure.

 Actually implementing this idea is a problematic, not least because we
 need to know whether all the currently emitted code from the last closure
 onwards is safe to work-duplicate in this way. It is not sufficient to
 just check the scrutinee - it depends on the context of the cgCase.

 For reference, the patch I tested is as follows:

 {{{

 $ git diff
 diff --git a/compiler/codeGen/CgCase.lhs b/compiler/codeGen/CgCase.lhs
 index dd607de..ad7610e 100644
 --- a/compiler/codeGen/CgCase.lhs
 +++ b/compiler/codeGen/CgCase.lhs
 @@ -190,10 +190,15 @@ cgCase (StgOpApp (StgPrimOp SeqOp) [StgVarArg a, _]
 _)
  Special case #3: inline PrimOps and foreign calls.

  \begin{code}
 -cgCase (StgOpApp (StgPrimOp primop) args _)
 -       _live_in_whole_case live_in_alts bndr alt_type alts
 +cgCase e@(StgOpApp (StgPrimOp primop) args _)
 +       live_in_whole_case live_in_alts bndr alt_type alts
    | not (primOpOutOfLine primop)
 -  = cgInlinePrimOp primop args bndr alt_type live_in_alts alts
 +  = do { hp_usage <- getHpUsage
 +       ; if heapHWM hp_usage > heapHWM initHpUsage || length alts <= 1 --
 Reuse existing heap check for alts if it already exists, or if there is no
 way
 +          then cgInlinePrimOp primop args bndr alt_type live_in_alts alts
 +          else cgCase' e live_in_whole_case live_in_alts bndr alt_type
 alts }
 +cgCase expr live_in_whole_case live_in_alts bndr alt_type alts
 +  = cgCase' expr live_in_whole_case live_in_alts bndr alt_type alts
  \end{code}

  TODO: Case-of-case of primop can probably be done inline too (but
 @@ -205,7 +210,7 @@ Special case #4: inline foreign calls: an unsafe
 foreign call can be done
  right here, just like an inline primop.

  \begin{code}
 -cgCase (StgOpApp (StgFCallOp fcall _) args _)
 +cgCase' (StgOpApp (StgFCallOp fcall _) args _)
         _live_in_whole_case live_in_alts _bndr _alt_type alts
    | unsafe_foreign_call
    = ASSERT( isSingleton alts )
 @@ -229,7 +234,7 @@ This can be done a little better than the general
 case, because
  we can reuse/trim the stack slot holding the variable (if it is in one).

  \begin{code}
 -cgCase (StgApp fun args)
 +cgCase' (StgApp fun args)
          _live_in_whole_case live_in_alts bndr alt_type alts
    = do  { fun_info <- getCgIdInfo fun
          ; arg_amodes <- getArgAmodes args
 @@ -268,7 +273,7 @@ deAllocStackTop call is doing above.
  Finally, here is the general case.

  \begin{code}
 -cgCase expr live_in_whole_case live_in_alts bndr alt_type alts
 +cgCase' expr live_in_whole_case live_in_alts bndr alt_type alts
    = do  {       -- Figure out what volatile variables to save
            nukeDeadBindings live_in_whole_case
 }}}

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/1498#comment:12>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler

_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Reply via email to