#7429: Unexplained performance boost with +RTS -h ---------------------------------+------------------------------------------ Reporter: simonmar | Owner: Type: bug | Status: new Priority: normal | Milestone: 7.8.1 Component: Compiler | Version: 7.6.1 Keywords: | Os: Unknown/Multiple Architecture: Unknown/Multiple | Failure: Runtime performance bug Difficulty: Unknown | Testcase: Blockedby: | Blocking: Related: | ---------------------------------+------------------------------------------ In #5505 the following program was reported:
{{{ {-# LANGUAGE RankNTypes ,RecursiveDo ,TypeFamilies ,GeneralizedNewtypeDeriving #-} import System.TimeIt import qualified Data.Vector as V import Control.Monad.State.Strict import Control.Monad.ST.Strict import Data.STRef -- STO Monad is just StateT with an Int stacked on top of a ST monad. newtype STO s a = STO { unSTO :: StateT Int (ST s) a } deriving (Functor, Monad, MonadFix) runSTO :: (forall s. STO s a) -> a runSTO x = runST (evalStateT (unSTO x) 0) data CircList s = ConsM {value :: {-# UNPACK #-} !Int ,cons :: {-# UNPACK #-} !(STRef s (CircList s )) } -- | Defines a circular list of length ns clist ns = do rec allItems <- let next i = nextCons i $ (V.!) allItems ((i+1) `mod` ns) in V.generateM ns next return $ (V.!) allItems 0 where nextCons v n = do n' <- switchSTorSTO $ newSTRef $ n return $ ConsM v n' -- | Take one step in the CircList oneStep (ConsM v c) = switchSTorSTO $ readSTRef c -- | I tie a circular list of size 100 and step through it n times. main :: IO () main = do let n = 333222111 :: Int timeIt . print $ runSTO ( clist 1001 >>= liftM value . iterateM n oneStep) --timeIt . print $ runST ( clist 1001 >>= liftM value . iterateM n oneStep) -- **************************** TO SWITCH TO ST: switch between the two sets of definitions above and below this line --switchSTorSTO = id switchSTorSTO = STO . lift iterateM :: (Monad m) => Int -> (b -> m b) -> b -> m b iterateM n f c = go n c where go 0 b = return b go ns b = f b >>= go (ns-1) }}} Which on my system, when compiled for profiling, goes more than twice as fast with `+RTS -h`. The following is some output from `perf stat` with both versions: {{{ > perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled- cycles-backend ./5505 222 CPU time: 5.34s Performance counter stats for './5505': 10,676,916,633 cycles # 0.000 GHz [75.00%] 11,023,609,154 instructions # 1.03 insns per cycle # 0.60 stalled cycles per insn [75.01%] 462,115,593 L1-dcache-load-misses [75.01%] 6,661,087,690 stalled-cycles-backend # 62.39% backend cycles idle [75.01%] 5.346786849 seconds time elapsed }}} and for the fast version: {{{ > perf stat -e cycles -e instructions -e L1-dcache-load-misses -e stalled- cycles-backend ./5505 +RTS -h 222 CPU time: 2.38s Performance counter stats for './5505 +RTS -h': 4,766,694,528 cycles # 0.000 GHz [75.02%] 10,691,687,462 instructions # 2.24 insns per cycle # 0.06 stalled cycles per insn [75.02%] 18,763,412 L1-dcache-load-misses [75.02%] 673,890,507 stalled-cycles-backend # 14.14% backend cycles idle [74.99%] 2.399953273 seconds time elapsed }}} There's a huge difference in the number of L1-dcache misses, and the slow version is stalled for 62% of the time. Why is this? I'm not exactly sure. The program doesn't allocate anything except at the beginning. The effect of `-h` is to cause a major GC to happen every 0.1s or so, which copies some data, which should tend to cause more cache misses, rather than fewer. I don't know what's going on, so I'm leaving the details in this ticket for now. -- Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/7429> GHC <http://www.haskell.org/ghc/> The Glasgow Haskell Compiler _______________________________________________ Glasgow-haskell-bugs mailing list Glasgow-haskell-bugs@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs