Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
so I had a look at the code. The loops are all fine. replicateM_ isn't a problem, but getDot is decidedly non trivial. Lots of pattern matching on different vector forms, and to top it off ffi calls. With some inlining in the blas library I was able to cut a few seconds off the running time, but getDot looks to be fundamentally a bit complicated in the current implementation. I wonder if you'll get different results with hmatrix? Anyway, this is a library issue. Better take it up with Patrick. Pass on to the library author the C code, the Haskell you think should be compiled identically. -- Don aeyakovenko: > i get the same crappy performance with: > > $ cat htestdot.hs > {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields > -fglasgow-exts -fbang-patterns -lcblas#-} > module Main where > > import Data.Vector.Dense.IO > import Control.Monad > > main = do >let size = 10 >let times = 10*1000*1000 >v1::IOVector Int Double <- newListVector size $ replicate size 0.1 >v2::IOVector Int Double <- newListVector size $ replicate size 0.1 >replicateM_ times $ v1 `getDot` v2 > > > > On Fri, Jun 27, 2008 at 7:41 PM, Dan Doel <[EMAIL PROTECTED]> wrote: > > On Friday 27 June 2008, Anatoly Yakovenko wrote: > >> $ cat htestdot.hs > >> {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields > >> -fglasgow-exts -fbang-patterns -lcblas#-} > >> module Main where > >> > >> import Data.Vector.Dense.IO > >> import Control.Monad > >> > >> main = do > >>let size = 10 > >>let times = 10*1000*1000 > >>v1::IOVector Int Double <- newListVector size $ replicate size 0.1 > >>v2::IOVector Int Double <- newListVector size $ replicate size 0.1 > >>sum <- foldM (\ ii zz -> do > >> rv <- v1 `getDot` v2 > >> return $ zz + rv > >> ) 0.0 [0..times] > >>print $ sum > > > > Hackage is down for the time being, so I can't install blas and look at the > > core for your program. However, there are still some reasons why this code > > would be slow. > > > > For instance, a brief experiment seems to indicate that foldM is not a good > > consumer in the foldr/build sense, so no deforestation occurs. Your program > > is iterating over a 10-million element lazy list. That's going to add > > overhead. I wrote a simple test program which just adds 0.1 in each > > iteration: > > > > snip > > > > {-# LANGUAGE BangPatterns #-} > > > > module Main (main) where > > > > import Control.Monad > > > > main = do > > let times = 10*1000*1000 > > sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times] > > -- sum <- foo 0 times 0.0 > > print $ sum > > > > foo :: Int -> Int -> Double -> IO Double > > foo k m !zz > > | k <= m = foo (k+1) m (zz + 0.1) > > | otherwise = return zz > > > > snip > > > > With foldM, it takes 2.5 seconds on my machine. If you comment that line, > > and > > use foo instead, it takes around .1 seconds. So that's a factor of what, > > 250? > > That loop allows for a lot more unboxing, which allows much better code to > > be > > generated. > > > > When Hackage comes back online, I'll take a look at your code, and see if I > > can make it run faster, but you might want to try it yourself in the time > > being. Strictifying the addition of the accumulator is probably a good idea, > > for instance. > > > > Cheers, > > -- Dan > > > ___ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe Sat Jun 28 00:33:17 PDT 2008 Don Stewart <[EMAIL PROTECTED]> * Some inlining and unpacking for DVectors New patches: [Some inlining and unpacking for DVectors Don Stewart <[EMAIL PROTECTED]>**20080628073317] { hunk ./BLAS/Internal.hs 115 +{-# INLINE checkVecVecOp #-} hunk ./Data/Vector/Dense/Internal.hs 78 -data DVector t n e = - DV { fptr :: !(ForeignPtr e) -- ^ a pointer to the storage region - , offset :: !Int-- ^ an offset (in elements, not bytes) to the first element in the vector. - , len:: !Int-- ^ the length of the vector - , stride :: !Int-- ^ the stride (in elements, not bytes) between elements. +data DVector t n e = + DV { fptr :: {-# UNPACK #-} !(ForeignPtr e) -- ^ a pointer to the storage region + , offset :: {-# UNPACK #-} !Int-- ^ an offset (in elements, not bytes) to the first element in the vector. + , len:: {-# UNPACK #-} !Int-- ^ the length of the vector + , stride :: {-# UNPACK #-} !Int-- ^ the stride (in elements, not bytes) between elements. hunk ./Data/Vector/Dense/Internal.hs 84 -| C !(DVector t n e)-- ^ a conjugated vector +| C {-# UNPACK #-} !(DVector t n e)-- ^ a conjugated vector hunk ./Data/Vector/Dense/Internal.hs 92 +{-# INLINE coerceVector #-} hunk ./Data/Vector/Dense/Internal.hs 424 + hunk ./Data/Vector/Dense/Operations.hs 144 +{-# INLI
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
aeyakovenko: > i get the same crappy performance with: > > $ cat htestdot.hs > {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields > -fglasgow-exts -fbang-patterns -lcblas#-} > module Main where > > import Data.Vector.Dense.IO > import Control.Monad > > main = do >let size = 10 >let times = 10*1000*1000 >v1::IOVector Int Double <- newListVector size $ replicate size 0.1 >v2::IOVector Int Double <- newListVector size $ replicate size 0.1 >replicateM_ times $ v1 `getDot` v2 replicateM_ is using a list underneath for control as well, replicateM n x= sequence (replicate n x) Try writing a simple recursive loop, as Dan suggested. No list node forcing overhead, so in a very tight loop you'll just want the index in a register. See here for more examples of tight register loops, http://cgi.cse.unsw.edu.au/~dons/blog/2008/05/16#fast In general, if you're chasing C performance for a loop, your best bet is to write a loop first. Then later see if you can get the same kind of code from higher order, lazy, monadic functions. -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
i get the same crappy performance with: $ cat htestdot.hs {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields -fglasgow-exts -fbang-patterns -lcblas#-} module Main where import Data.Vector.Dense.IO import Control.Monad main = do let size = 10 let times = 10*1000*1000 v1::IOVector Int Double <- newListVector size $ replicate size 0.1 v2::IOVector Int Double <- newListVector size $ replicate size 0.1 replicateM_ times $ v1 `getDot` v2 On Fri, Jun 27, 2008 at 7:41 PM, Dan Doel <[EMAIL PROTECTED]> wrote: > On Friday 27 June 2008, Anatoly Yakovenko wrote: >> $ cat htestdot.hs >> {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields >> -fglasgow-exts -fbang-patterns -lcblas#-} >> module Main where >> >> import Data.Vector.Dense.IO >> import Control.Monad >> >> main = do >>let size = 10 >>let times = 10*1000*1000 >>v1::IOVector Int Double <- newListVector size $ replicate size 0.1 >>v2::IOVector Int Double <- newListVector size $ replicate size 0.1 >>sum <- foldM (\ ii zz -> do >> rv <- v1 `getDot` v2 >> return $ zz + rv >> ) 0.0 [0..times] >>print $ sum > > Hackage is down for the time being, so I can't install blas and look at the > core for your program. However, there are still some reasons why this code > would be slow. > > For instance, a brief experiment seems to indicate that foldM is not a good > consumer in the foldr/build sense, so no deforestation occurs. Your program > is iterating over a 10-million element lazy list. That's going to add > overhead. I wrote a simple test program which just adds 0.1 in each > iteration: > > snip > > {-# LANGUAGE BangPatterns #-} > > module Main (main) where > > import Control.Monad > > main = do > let times = 10*1000*1000 > sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times] > -- sum <- foo 0 times 0.0 > print $ sum > > foo :: Int -> Int -> Double -> IO Double > foo k m !zz > | k <= m = foo (k+1) m (zz + 0.1) > | otherwise = return zz > > snip > > With foldM, it takes 2.5 seconds on my machine. If you comment that line, and > use foo instead, it takes around .1 seconds. So that's a factor of what, 250? > That loop allows for a lot more unboxing, which allows much better code to be > generated. > > When Hackage comes back online, I'll take a look at your code, and see if I > can make it run faster, but you might want to try it yourself in the time > being. Strictifying the addition of the accumulator is probably a good idea, > for instance. > > Cheers, > -- Dan > ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
On Friday 27 June 2008, Anatoly Yakovenko wrote: > $ cat htestdot.hs > {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields > -fglasgow-exts -fbang-patterns -lcblas#-} > module Main where > > import Data.Vector.Dense.IO > import Control.Monad > > main = do >let size = 10 >let times = 10*1000*1000 >v1::IOVector Int Double <- newListVector size $ replicate size 0.1 >v2::IOVector Int Double <- newListVector size $ replicate size 0.1 >sum <- foldM (\ ii zz -> do > rv <- v1 `getDot` v2 > return $ zz + rv > ) 0.0 [0..times] >print $ sum Hackage is down for the time being, so I can't install blas and look at the core for your program. However, there are still some reasons why this code would be slow. For instance, a brief experiment seems to indicate that foldM is not a good consumer in the foldr/build sense, so no deforestation occurs. Your program is iterating over a 10-million element lazy list. That's going to add overhead. I wrote a simple test program which just adds 0.1 in each iteration: snip {-# LANGUAGE BangPatterns #-} module Main (main) where import Control.Monad main = do let times = 10*1000*1000 sum <- foldM (\_ zz -> return $ zz + 0.1) 0.0 [0..times] -- sum <- foo 0 times 0.0 print $ sum foo :: Int -> Int -> Double -> IO Double foo k m !zz | k <= m = foo (k+1) m (zz + 0.1) | otherwise = return zz snip With foldM, it takes 2.5 seconds on my machine. If you comment that line, and use foo instead, it takes around .1 seconds. So that's a factor of what, 250? That loop allows for a lot more unboxing, which allows much better code to be generated. When Hackage comes back online, I'll take a look at your code, and see if I can make it run faster, but you might want to try it yourself in the time being. Strictifying the addition of the accumulator is probably a good idea, for instance. Cheers, -- Dan ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
> I suspect that it is your initialization that is the difference. For > one thing, you've initialized the arrays to different values, and in > your C code you've fused what are two separate loops in your Haskell > code. So you've not only given the C compiler an easier loop to run > (since you're initializing the array to a constant rather than to a > sequence of numbers), but you've also manually optimized that > initialization. In fact, this fusion could be precisely the factor of > two. Why not see what happens in Haskell if you create just one > vector and dot it with itself? (of course, that'll also make the blas > call faster, so you'll need to be careful in your interpretation of > your results.) The difference cant be in the initialization. I am calling the dot product a million times, the malloc and init in both cases are insignificant. Also, "fusing" the two loops in C probably wont help, if anything having each loop run separate is likely to be faster and result in less cache misses. In this case, i am using vectors of size 10 only, and calling the loop 10 million times, haskell is far far slower, or 35 times. That's pretty crappy. $ cat htestdot.hs {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields -fglasgow-exts -fbang-patterns -lcblas#-} module Main where import Data.Vector.Dense.IO import Control.Monad main = do let size = 10 let times = 10*1000*1000 v1::IOVector Int Double <- newListVector size $ replicate size 0.1 v2::IOVector Int Double <- newListVector size $ replicate size 0.1 sum <- foldM (\ ii zz -> do rv <- v1 `getDot` v2 return $ zz + rv ) 0.0 [0..times] print $ sum $ ghc --make htestdot.hs $ time ./htestdot 1.0001e7 real0m17.328s user0m17.320s sys 0m0.010 $ cat testdot.c #include #include #include #include int main() { int size = 10; int times = 10*1000*1000; int ii = 0; double dd = 0.0; double* v1 = malloc(sizeof(double) * (size)); double* v2 = malloc(sizeof(double) * (size)); for(ii = 0; ii < size; ++ii) { v1[ii] = 0.1; } for(ii = 0; ii < size; ++ii) { v2[ii] = 0.1; } for(ii = 0; ii < times; ++ii) { dd += cblas_ddot(size, v1, 1, v2, 1); } free(v1); free(v2); printf("%f\n", dd); return 0; } $ gcc -O2 testdot.c -lcblas -o testdot $ time ./testdot 99.999839 real0m0.491s user0m0.480s sys 0m0.020s Just to make sure that fold isnt causing the slowdown, i reverted the haskell program to use the mapM_, i still got almost the same performance: $ cat htestdot.hs {-# OPTIONS_GHC -O2 -fexcess-precision -funbox-strict-fields -fglasgow-exts -fbang-patterns -lcblas#-} module Main where import Data.Vector.Dense.IO import Control.Monad main = do let size = 10 let times = 10*1000*1000 v1::IOVector Int Double <- newListVector size $ replicate size 0.1 v2::IOVector Int Double <- newListVector size $ replicate size 0.1 mapM_ (\ ii -> do v1 `getDot` v2) [0..times] $ ghc --make htestdot $ time ./htestdot real0m15.660s user0m15.630s sys 0m0.030s This is what the profiler has to say: $ cat htestdot.prof Fri Jun 27 18:06 2008 Time and Allocation Profiling Report (Final) htestdot +RTS -p -RTS total time = 22.00 secs (1100 ticks @ 20 ms) total alloc = 3,320,010,716 bytes (excludes profiling overheads) COST CENTREMODULE %time %alloc main Main 100.0 100.0 individualinherited COST CENTRE MODULE no.entries %time %alloc %time %alloc MAIN MAIN 1 0 0.00.0 100.0 100.0 mainMain 222 1 93.6 88.093.6 88.0 CAF Main 216 5 0.00.0 6.4 12.0 main Main 223 0 6.4 12.0 6.4 12.0 CAF GHC.Handle 168 3 0.00.0 0.00.0 ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
On 19 Jun 2008, at 4:16 am, Anatoly Yakovenko wrote: C doesn't work like that :). functions always get called. Not true. A C compiler must produce the same *effect* as if the function had been called, but if by some means the compiler knows that the function has no effect, it is entitled to skip the call. In particular, the C compiler I normally use offers these pragmas, amongst others: #pragma does_not_write_global_data (funcname [, funcname]) #pragma no_side_effect(funcname[, funcname]) So with a declaration like extern double cblas_ddot( int, double const *, int, double const *, int); #pragma no_side_effect (cblas_ddot) the compiler would be completely within its rights to discard any call to cblas_ddot() whose result was not used. (As it happens, it didn't, but it would have been allowed to.) If using gcc, extern double cblas_ddot( ... as before ...) __attribute__ ((const)); seems to have the same effect, certainly the test case I tried did in fact completely eliminate a call to cblas_ddot() when so declared. Since the malloc() results pointed to uninitialised memory, the C compiler was entitled to do anything it pleased anyway. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
On Wed, Jun 18, 2008 at 06:03:42PM +0100, Jules Bean wrote: > Anatoly Yakovenko wrote: > >>>#include > >>>#include > >>> > >>>int main() { > >>> int size = 1024; > >>> int ii = 0; > >>> double* v1 = malloc(sizeof(double) * (size)); > >>> double* v2 = malloc(sizeof(double) * (size)); > >>> for(ii = 0; ii < size*size; ++ii) { > >>> double _dd = cblas_ddot(0, v1, size, v2, size); > >>> } > >>> free(v1); > >>> free(v2); > >>>} > >>Your C compiler sees that you're not using the result of cblas_ddot, > >>so it doesn't even bother to call it. That loop never gets run. All > >>your program does at runtime is call malloc and free twice, which is > >>very fast :-) > > > >C doesn't work like that :). > > C compilers can do what they like ;) > > GCC in particular is pretty good at removing dead code, including entire > loops. However it shouldn't eliminate the call to cblas_ddot unless it > thinks cblas_ddot has no side effects at all, which would be surprising > unless it's inlined somehow. Or unless it's been annotated as pure, which it should be. David ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
On Wed, Jun 18, 2008 at 09:16:24AM -0700, Anatoly Yakovenko wrote: > >> #include > >> #include > >> > >> int main() { > >> int size = 1024; > >> int ii = 0; > >> double* v1 = malloc(sizeof(double) * (size)); > >> double* v2 = malloc(sizeof(double) * (size)); > >> for(ii = 0; ii < size*size; ++ii) { > >> double _dd = cblas_ddot(0, v1, size, v2, size); > >> } > >> free(v1); > >> free(v2); > >> } > > > > Your C compiler sees that you're not using the result of cblas_ddot, > > so it doesn't even bother to call it. That loop never gets run. All > > your program does at runtime is call malloc and free twice, which is > > very fast :-) > > C doesn't work like that :). functions always get called. but i did > find a problem with my C code, i am incorrectly calling the dot > production function: See a recent article in lwn on pure and const functions to see how gcc is able to perform dead code elimination and CSE, provided its given annotations on the relevant functions. I'd certainly hope that your blas library is properly annotated! > #include > #include > #include > #include > > int main() { >int size = 1024; >int ii = 0; >double dd = 0.0; >double* v1 = malloc(sizeof(double) * (size)); >double* v2 = malloc(sizeof(double) * (size)); >for(ii = 0; ii < size; ++ii) { > v1[ii] = 0.1; > v2[ii] = 0.1; >} >for(ii = 0; ii < size*size; ++ii) { > dd += cblas_ddot(size, v1, 0, v2, 0); >} >free(v1); >free(v2); >printf("%f\n", dd); >return 0; > } > > time ./testdot > 10737418.240187 > > real0m2.200s > user0m2.190s > sys 0m0.010s > > So C is about twice as fast. I can live with that. I suspect that it is your initialization that is the difference. For one thing, you've initialized the arrays to different values, and in your C code you've fused what are two separate loops in your Haskell code. So you've not only given the C compiler an easier loop to run (since you're initializing the array to a constant rather than to a sequence of numbers), but you've also manually optimized that initialization. In fact, this fusion could be precisely the factor of two. Why not see what happens in Haskell if you create just one vector and dot it with itself? (of course, that'll also make the blas call faster, so you'll need to be careful in your interpretation of your results.) David ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
Anatoly Yakovenko wrote: #include #include int main() { int size = 1024; int ii = 0; double* v1 = malloc(sizeof(double) * (size)); double* v2 = malloc(sizeof(double) * (size)); for(ii = 0; ii < size*size; ++ii) { double _dd = cblas_ddot(0, v1, size, v2, size); } free(v1); free(v2); } Your C compiler sees that you're not using the result of cblas_ddot, so it doesn't even bother to call it. That loop never gets run. All your program does at runtime is call malloc and free twice, which is very fast :-) C doesn't work like that :). C compilers can do what they like ;) GCC in particular is pretty good at removing dead code, including entire loops. However it shouldn't eliminate the call to cblas_ddot unless it thinks cblas_ddot has no side effects at all, which would be surprising unless it's inlined somehow. Jules ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
On Wed, Jun 18, 2008 at 9:16 AM, Anatoly Yakovenko <[EMAIL PROTECTED]> wrote: > C doesn't work like that :) Yes it can. You would have to check the disassembly to be sure, but C compilers can, and do, perform dead code elimination. AGL -- Adam Langley [EMAIL PROTECTED] http://www.imperialviolet.org ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
>> #include >> #include >> >> int main() { >> int size = 1024; >> int ii = 0; >> double* v1 = malloc(sizeof(double) * (size)); >> double* v2 = malloc(sizeof(double) * (size)); >> for(ii = 0; ii < size*size; ++ii) { >> double _dd = cblas_ddot(0, v1, size, v2, size); >> } >> free(v1); >> free(v2); >> } > > Your C compiler sees that you're not using the result of cblas_ddot, > so it doesn't even bother to call it. That loop never gets run. All > your program does at runtime is call malloc and free twice, which is > very fast :-) C doesn't work like that :). functions always get called. but i did find a problem with my C code, i am incorrectly calling the dot production function: #include #include #include #include int main() { int size = 1024; int ii = 0; double dd = 0.0; double* v1 = malloc(sizeof(double) * (size)); double* v2 = malloc(sizeof(double) * (size)); for(ii = 0; ii < size; ++ii) { v1[ii] = 0.1; v2[ii] = 0.1; } for(ii = 0; ii < size*size; ++ii) { dd += cblas_ddot(size, v1, 0, v2, 0); } free(v1); free(v2); printf("%f\n", dd); return 0; } time ./testdot 10737418.240187 real0m2.200s user0m2.190s sys 0m0.010s So C is about twice as fast. I can live with that. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] blas bindings, why are they so much slower the C?
On Tue, Jun 17, 2008 at 9:00 PM, Anatoly Yakovenko <[EMAIL PROTECTED]> wrote: > here is the C: > > #include > #include > > int main() { > int size = 1024; > int ii = 0; > double* v1 = malloc(sizeof(double) * (size)); > double* v2 = malloc(sizeof(double) * (size)); > for(ii = 0; ii < size*size; ++ii) { > double _dd = cblas_ddot(0, v1, size, v2, size); > } > free(v1); > free(v2); > } Your C compiler sees that you're not using the result of cblas_ddot, so it doesn't even bother to call it. That loop never gets run. All your program does at runtime is call malloc and free twice, which is very fast :-) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] blas bindings, why are they so much slower the C?
here is the C: #include #include int main() { int size = 1024; int ii = 0; double* v1 = malloc(sizeof(double) * (size)); double* v2 = malloc(sizeof(double) * (size)); for(ii = 0; ii < size*size; ++ii) { double _dd = cblas_ddot(0, v1, size, v2, size); } free(v1); free(v2); } this is the haskell: module Main where import Data.Vector.Dense.IO main = do let size = 1024 v1::IOVector Int Double <- newListVector size [0..] v2::IOVector Int Double <- newListVector size [0..] mapM_ (\ ii -> do v1 `getDot` v2) [0..size*size] time ./testdot real0m0.017s user0m0.010s sys 0m0.010s time ./htestdot real0m4.692s user0m4.670s sys 0m0.030s so like 250x difference htestdot.prof is no help Tue Jun 17 20:46 2008 Time and Allocation Profiling Report (Final) htestdot +RTS -p -RTS total time =3.92 secs (196 ticks @ 20 ms) total alloc = 419,653,032 bytes (excludes profiling overheads) COST CENTREMODULE %time %alloc main Main 88.3 83.0 CAFMain 11.7 17.0 individualinherited COST CENTRE MODULE no.entries %time %alloc %time %alloc MAIN MAIN 1 0 0.00.0 100.0 100.0 CAF Main 216 7 11.7 17.0 100.0 100.0 main Main 222 1 88.3 83.088.3 83.0 CAF GHC.Float 187 1 0.00.0 0.00.0 CAF GHC.Handle 168 3 0.00.0 0.00.0 ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe