Re: More speed please!

Josef Svenningsson Tue, 01 May 2007 08:58:25 -0700

I'm replying to a rather old thread here, about unboxing in functions. Duncan
had a continuation monad which passed around some data type that would be nice
to unbox. You discussed strictness annotations in function types as a potential
solution. I have a different tack on the problem which seems potentially
useful. I've experimented with doing local defunctionalization on the module.
This is a long mail as I will try to explain in some detail what it is that I
have done. Please be patient.


Normal defunctionalization is about replacing the primitive function type
"a -> b" with an algebraic data type which I'll call "Fun a b". Not all
functions will be eliminated as we will see but the program will be first
order after the transformation. The core of the transformation is that every
lambda in the program gives rise to a new constructor in the Fun data type and
whenever we apply a function we instead call a newly created "apply function"
with the following type "Fun a b -> a -> b". This is basically what JHC does.

Defunctionalization is normally a whole program transformation (which is why
JHC is a whole program compiler). But sometimes it can be done on a per module
basis. This is where *local* defunctionalization comes in. The key to local
defunctionalization is that we often can divide the data type Fun into several
disjoint data types. We can do this whenever there are several different
function spaces that never get mixed up. And sometimes we're even so lucky
that a function space is totally contained in one module. Then we can do
local defunctionalization of that particular function space only and
completely within that module without changing it's interface. This case often
comes up when using the continuation monad and Duncan's code is not an
exception.

So, I've manually done local defunctionalization on Duncan's code. It gives
rise to two types which I've called Fun1 and Fun2. They look like follows
(including the Put monad):

\begin{code}
newtype Put a = Put {
       runPut :: Fun2 a
   }

data Fun1 a where
 Bind     :: (a -> Put b) -> Fun1 b -> Fun1 a
 Then     :: Put b  -> Fun1 b -> Fun1 a
 Run      :: Fun1 ()
 FlushOld :: !(Fun1 ()) -> !Int -> !(ForeignPtr Word8) -> !Int -> !Int
           -> Fun1 ()

data Fun2 a where
 Return :: a -> Fun2 a
 Bind2  :: Put a -> (a -> Put b) -> Fun2 b
 Then2  :: Put a -> Put b -> Fun2 b
 Flush  :: Fun2 ()
 Write  :: !Int -> (Ptr Word8 -> IO ()) -> Fun2 ()
\end{code}
Intuitively every constructor corresponds to a closure. I've chosen the name
for the constructor based on which function the closure appears in.

The respective apply functions for these data types acts as interpreters and
executes the corresponding code for each constructor/closure. Their type look
as follow:

\begin{code}
apply1 :: Fun1 a -> a -> Buffer -> [B.ByteString]
apply2 :: Fun2 a -> Fun1 a -> Buffer -> [B.ByteString]
\end{code}

Now, the cool thing is that once GHC starts optimizing away on these apply
functions they will be unboxed and no Buffer will ever be created or passed
around. Here is the core type for apply1:
\begin{core}
$wapply1_r21p :: forall a_aQu.
                 PutMonad.Fun1 a_aQu
                 -> a_aQu
                 -> GHC.Prim.Addr#
                 -> GHC.ForeignPtr.ForeignPtrContents
                 -> GHC.Prim.Int#
                 -> GHC.Prim.Int#
                 -> GHC.Prim.Int#
                 -> [Data.ByteString.Base.ByteString]
\end{core}
This is exactly what Duncan wanted, right? I declare victory :-)

However, things are not all roses. There are some functions that will
not be unboxed as we hope for with this approach, for instance the function
flushOld (see Duncan's code). To achieve the best possible optimization I
think one would have to perform strictness analysis and the worker-wrapper
transformation twice, once before doing local defunctionalization and then
again on the apply functions generated by the defunctionalization process.
This should give the code that Duncan wants I believe.

I think it should be relatively straightforward to implement local
defunctionalization in GHC but it should not be turned on by default as the
number of modules where it is beneficial is rather few.

The complete defunctionalized version of Duncan's module is attached.

I'm sure there are a lot of things that are somewhat unclear in this message.
Feel free to ask and I'll do my best to clarify.

Cheers,

Josef

{-# OPTIONS -fglasgow-exts -fbang-patterns -cpp #-}

module PutMonad (
    -- * The Put type
      Put
    , run     -- :: Put () -> L.ByteString

    -- * Flushing the implicit parse state
    , flush   -- :: Put ()

    -- * Primitives
    , write   -- :: Int -> (Ptr Word8 -> IO ()) -> Put ()
    , word8   -- :: Word8 -> Put ()
  ) where

import Foreign
import qualified Data.ByteString.Base as B (
                   ByteString(PS), LazyByteString(LPS),
                   inlinePerformIO, mallocByteString, nullForeignPtr)
import qualified Data.ByteString.Lazy as L (ByteString)

-- Our internal buffer type
data Buffer = Buffer {-# UNPACK #-} !(ForeignPtr Word8)
                     {-# UNPACK #-} !Int                -- offset
                     {-# UNPACK #-} !Int                -- used bytes
                     {-# UNPACK #-} !Int                -- length left

newtype Put a = Put {
        runPut :: Fun2 a -- Fun1 a -> Buffer -> [B.ByteString]
    }

data Fun1 a where
  Bind     :: (a -> Put b) -> Fun1 b -> Fun1 a
  Then     ::       Put b  -> Fun1 b -> Fun1 a
  Run      ::                           Fun1 ()
  FlushOld :: !(Fun1 ()) -> !Int -> !(ForeignPtr Word8) -> !Int -> !Int 
            -> Fun1 ()

apply1 :: Fun1 a -> a -> Buffer -> [B.ByteString]
apply1 (Bind k c) a b = apply2 (runPut (k a)) c b
apply1 (Then k c) _ b = apply2 (runPut k)     c b
apply1 (Run)      _ (Buffer fp o u l) =
         if u == 0
         then               []
         else B.PS fp o u : []
apply1 (FlushOld c n fp o u) _ (Buffer fp' _ _ _) =
  if u == 0
    then               apply1 c () (Buffer fp' 0 n size)
    else B.PS fp o u : apply1 c () (Buffer fp' 0 n size)

  where size = max n defaultSize - n


data Fun2 a where
  Return :: a -> Fun2 a
  Bind2  :: Put a -> (a -> Put b) -> Fun2 b
  Then2  :: Put a -> Put b -> Fun2 b
  Flush  :: Fun2 ()
  Write  :: !Int -> (Ptr Word8 -> IO ()) -> Fun2 ()

apply2 :: Fun2 a -> Fun1 a -> Buffer -> [B.ByteString]
apply2 (Return a)  c b = apply1 c a b
apply2 (Bind2 m k) c b = apply2 (runPut m) (Bind k c) b
apply2 (Then2 m k) c b = apply2 (runPut m) (Then k c) b
apply2 (Flush)     c buf@(Buffer fp o u l) = 
  if u == 0
    then               apply1 c () buf
    else B.PS fp o u : apply1 c () (Buffer fp (o+u) 0 l)
apply2 (Write n body) c buf@(Buffer fp o u l) =
  if n <= l
    then write' c fp o u l
    else write' (flushOld c n fp o u) (newBuffer c n) 0 0 0
 
  where -- BUG: Aaargh! GHC always inlines write' in both branches
        --      above despite the NOINLINE pragma below.
       
        {-# NOINLINE write' #-}
        write' c !fp !o !u !l =
          -- warning: this is a tad hardcore
          B.inlinePerformIO
            (withForeignPtr fp
              (\p -> body $! (p `plusPtr` (o+u))))
          `seq` apply1 c () (Buffer fp o (u+n) (l-n))

instance Monad Put where
	return = returnPut
	(>>=)  = bindPut
        (>>)   = thenPut
        {-# INLINE (>>) #-}

-- The monad bits

returnPut :: a -> Put a
returnPut a = Put (Return a)
{-# INLINE returnPut #-}

bindPut :: Put a -> (a -> Put b) -> Put b
bindPut m k = Put (Bind2 m k)
{-# INLINE bindPut #-}

thenPut :: Put a -> Put b -> Put b 
thenPut m k = Put (Then2 m k)
{-# INLINE [1] thenPut #-}

-- The rest

run :: Put () -> L.ByteString
run put = B.LPS $
  apply2 (runPut put)
    Run
    (Buffer B.nullForeignPtr 0 0 0)

flush :: Put ()
flush = Put Flush
{-# NOINLINE [1] flush #-}

write :: Int -> (Ptr Word8 -> IO ()) -> Put ()
write !n body = Put (Write n body)

{-# INLINE [1] write #-}

-- note the sneaky unused first parameter to stop newBuffer getting floated out
-- which would mean all bufferes would actually be the same buffer, which would
-- be bad. :-)
newBuffer :: t -> Int -> ForeignPtr a
newBuffer _ n = B.inlinePerformIO (B.mallocByteString (max n defaultSize))
{-# NOINLINE newBuffer #-}

flushOld c !n !fp !o !u = FlushOld c n fp o u
{-
  _ (Buffer fp' _ _ _) =
  if u == 0
    then               apply1 c () (Buffer fp' 0 n size)
    else B.PS fp o u : apply1 c () (Buffer fp' 0 n size)

  where size = max n defaultSize - n
-}
{-# NOINLINE flushOld #-}

{-# RULES

"write/write" forall n m a b.
  write n a `thenPut` write m b = write (n+m) (\p -> a p >> b (p `plusPtr` n))

"flush/flush"
  flush `thenPut` flush = flush

  #-}

--
-- copied from Data.ByteString.Lazy
--
defaultSize :: Int
defaultSize = 32 * k - overhead
    where k = 1024
          overhead = 2 * sizeOf (undefined :: Int)

--
-- Various derived 'putters'
--

word8 :: Word8 -> Put ()
word8 !w = write 1 (pokeWord8 w)
{-# INLINE word8 #-}

pokeWord8 :: Word8 -> Ptr Word8 -> IO ()
pokeWord8 w p = poke p w
{-# INLINE pokeWord8 #-}

_______________________________________________
Glasgow-haskell-users mailing list
[email protected]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Re: More speed please!

Reply via email to