#2387: Optimizer misses unboxing opportunity
------------------------------------+---------------------------------------
    Reporter:  dolio                |       Owner:          
        Type:  bug                  |      Status:  new     
    Priority:  normal               |   Component:  Compiler
     Version:  6.8.2                |    Severity:  minor   
    Keywords:  optimizer unbox box  |    Testcase:          
Architecture:  x86_64 (amd64)       |          Os:  Linux   
------------------------------------+---------------------------------------
 In my studying of the fannkuch benchmark, I've discovered (I think)
 another missed optimization. A scaled down illustration goes as follows:

 {{{
 {-# LANGUAGE TypeOperators, BangPatterns #-}

 module Main (main) where

 import Control.Monad.ST

 import System.Environment

 data (:*:) a b = !a :*: !b

 whileLoop :: Int -> ST s Int
 whileLoop = go 0
  where
  go !n k
    | k == 0    = return n
    | otherwise = go (n+1) (k-1)
 {-# INLINE whileLoop #-}

 iter :: Int -> Int -> ST s (Bool :*: Int)
 iter k n = do
   k' <- whileLoop 40 >>= \k' -> return $! max k k'
   b <- return (n == 0)

   return $! b :*: k'
 {-# INLINE iter #-}

 mainLoop :: Int -> Int -> ST s Int
 mainLoop k n = do
   done :*: k' <- iter k n

   if done
     then return k'
     else mainLoop k' (n - 1)

 main = print =<< stToIO . mainLoop 0 . read . head =<< getArgs
 }}}

 If we look at the core for whileLoop's worker, we see:

 {{{
 $wpoly_go_r1aE :: forall s_aem.
                   Int# -> Int# -> STRep s_aem Int

 $wpoly_go_r1aE =
   \ (@ s_aem)
     (ww_s18Y :: Int#)
     (ww1_s192 :: Int#)
     (eta_s19w :: State# s_aem) ->
     case ww1_s192 of wild_XF {
       __DEFAULT ->
         $wpoly_go_r1aE @ s_aem (+# ww_s18Y 1) (-# wild_XF 1) eta_s19w;
       0 -> (# eta_s19w, I# ww_s18Y #)
     }
 }}}

 Note, the return type is a boxed Int. The function is only used once, like
 so:

 {{{
     ...
     case $wpoly_go_r1aE @ s_aem 0 40 w_s19f of wild_aFw { (# new_s_aFB,
 r_aFC #) ->
     case r_aFC of wild1_aG9 { I# y_aGb ->
     case <=# ww_s199 y_aGb of wild2_aGd {
       False ->
     ...
 }}}

 In other words, go boxes its results at the end of the loop, and the
 function that uses it immediately looks inside the box for the value. In
 this particular micro-benchmark, the boxed value (wild1_aG9 above) is
 actually used in the case where mainLoop returns the boxed value. However,
 in the larger benchmark I pulled this from, that is not the case in
 several areas (only the unboxed value is used, but it still goes through a
 box). Either way, the boxed value is only used at the end of the loop (and
 not every time there), and on every other iteration, this results in
 superfluous allocation.

 I'll attach a manually unboxed version I wrote (it also has iter and max
 manually inlined to mainLoop, since that makes it easier to write; the
 core for the above shows that they are getting inlined properly, so I
 assume that isn't the issue). Using +RTS -sstderr shows that (running 100
 million iterations here), the manually unboxed version allocates 50
 kilobytes on the heap, and runs in around 15 seconds, whereas the version
 above that doesn't get unboxed does 1.6 gigabytes of heap allocation, and
 takes 18 seconds (in the larger benchmark, such extra boxing would happen
 perhaps 40 million times over the course of the program).

 Thanks for your help.

-- 
Ticket URL: <http://hackage.haskell.org/trac/ghc/ticket/2387>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
_______________________________________________
Glasgow-haskell-bugs mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-bugs

Reply via email to