#7429: Unexplained performance boost with +RTS -h
---------------------------------+------------------------------------------
    Reporter:  simonmar          |       Owner:                         
        Type:  bug               |      Status:  new                    
    Priority:  normal            |   Milestone:  7.8.1                  
   Component:  Compiler          |     Version:  7.6.1                  
    Keywords:                    |          Os:  Unknown/Multiple       
Architecture:  Unknown/Multiple  |     Failure:  Runtime performance bug
  Difficulty:  Unknown           |    Testcase:                         
   Blockedby:                    |    Blocking:                         
     Related:                    |  
---------------------------------+------------------------------------------
 In #5505 the following program was reported:

 {{{
 {-# LANGUAGE
         RankNTypes
         ,RecursiveDo
         ,TypeFamilies
         ,GeneralizedNewtypeDeriving
   #-}

 import System.TimeIt
 import qualified Data.Vector as V
 import Control.Monad.State.Strict
 import Control.Monad.ST.Strict
 import Data.STRef

 -- STO Monad is just StateT with an Int stacked on top of a ST monad.
 newtype STO s a = STO { unSTO :: StateT Int (ST s) a }
     deriving (Functor, Monad, MonadFix)

 runSTO :: (forall s. STO s a) -> a
 runSTO x = runST (evalStateT (unSTO x) 0)

 data CircList s = ConsM {value :: {-# UNPACK #-} !Int
                         ,cons  :: {-# UNPACK #-} !(STRef s (CircList s ))
 }

 -- | Defines a circular list of length ns
 clist ns = do
     rec allItems <-
             let next i = nextCons i $ (V.!) allItems ((i+1) `mod` ns)
             in V.generateM ns next
     return $ (V.!) allItems 0
     where nextCons v n = do
                         n' <- switchSTorSTO  $ newSTRef $ n
                         return $ ConsM v n'

 -- | Take one step in the CircList
 oneStep (ConsM v c) = switchSTorSTO $ readSTRef c

 -- | I tie a circular list of size 100 and step through it n times.
 main :: IO ()
 main = do
     let n = 333222111 :: Int
     timeIt . print $ runSTO ( clist 1001 >>= liftM value . iterateM n
 oneStep)
     --timeIt . print $ runST ( clist 1001 >>= liftM value . iterateM n
 oneStep)
 --  ****************************  TO SWITCH TO ST: switch between the two
 sets of definitions above and below this line
 --switchSTorSTO = id
 switchSTorSTO  = STO . lift

 iterateM :: (Monad m) => Int -> (b -> m b) -> b -> m b
 iterateM n f c = go n c where
     go 0 b = return b
     go ns b = f b >>= go (ns-1)
 }}}

 Which on my system, when compiled for profiling, goes more than twice as
 fast with `+RTS -h`.  The following is some output from `perf stat` with
 both versions:

 {{{
 > perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled-
 cycles-backend ./5505
 222
 CPU time:   5.34s

  Performance counter stats for './5505':

     10,676,916,633 cycles                    #    0.000 GHz
 [75.00%]
     11,023,609,154 instructions              #    1.03  insns per cycle
                                              #    0.60  stalled cycles per
 insn [75.01%]
        462,115,593 L1-dcache-load-misses
 [75.01%]
      6,661,087,690 stalled-cycles-backend    #   62.39% backend  cycles
 idle    [75.01%]

        5.346786849 seconds time elapsed
 }}}

 and for the fast version:

 {{{
 > perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled-
 cycles-backend ./5505 +RTS -h
 222
 CPU time:   2.38s

  Performance counter stats for './5505 +RTS -h':

      4,766,694,528 cycles                    #    0.000 GHz
 [75.02%]
     10,691,687,462 instructions              #    2.24  insns per cycle
                                              #    0.06  stalled cycles per
 insn [75.02%]
         18,763,412 L1-dcache-load-misses
 [75.02%]
        673,890,507 stalled-cycles-backend    #   14.14% backend  cycles
 idle    [74.99%]

        2.399953273 seconds time elapsed
 }}}

 There's a huge difference in the number of L1-dcache misses, and the slow
 version is stalled for 62% of the time.

 Why is this?  I'm not exactly sure. The program doesn't allocate anything
 except at the beginning.  The effect of `-h` is to cause a major GC to
 happen every 0.1s or so, which copies some data, which should tend to
 cause more cache misses, rather than fewer.

 I don't know what's going on, so I'm leaving the details in this ticket
 for now.

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/7429>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler

_______________________________________________
Glasgow-haskell-bugs mailing list
Glasgow-haskell-bugs@haskell.org
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Reply via email to