Re: [Haskell-cafe] How to improve speed? (MersenneTwister is several times slower than C version)
Hello isto, Thursday, November 2, 2006, 1:16:55 AM, you wrote: I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and why you think it's a lazy? :) ST monad is just the same as IO monad internally, only types are different (there is also Lazy.ST monad - this is really lazy) 10-20 times difference is typical for GHC programs. i think that better results are observed only when C version is really bound by memory speed (that is rather typical situation) use JHC if you need to generate really fast code. or use ocaml/clean -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How to improve speed? (MersenneTwister is several times slower than C version)
bulat.ziganshin: Hello isto, Thursday, November 2, 2006, 1:16:55 AM, you wrote: I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and why you think it's a lazy? :) ST monad is just the same as IO monad internally, only types are different (there is also Lazy.ST monad - this is really lazy) 10-20 times difference is typical for GHC programs. ! It's really more like 2-4x. Sometimes better than C. Where's this huge figure coming from Bulat? If you have code that behaves like this, you should report it. -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How to improve speed? (MersenneTwister is several times slower than C version)
Oh, sorry, I thought your version was a rewritten version of mine. :) The names are so similar, after all. On Nov 2, 2006, at 02:26 , isto wrote: Hi, When writing IO version, I wasn't aware of other twister versions, and the only reason is/was that it was easiest to me and that I knew (believed) that plain lists would have been inefficient. I just wanted to see and learn, how close to C version this can be made. (And still do.) There were some good suggestions on this thread - next I'll try to get grasp on how to apply the suggestions and do something... br, Isto ke, 2006-11-01 kello 22:04 -0500, Lennart Augustsson kirjoitti: The whole point of writing the Mersenne Twister was that I wanted to show how a stateful computation could be encapsulated in the ST monad and none of it showing up outside. This aspect of the code is totally gone now when everything is in the IO monad. Is there some good reason to have it in the IO monad? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How to improve speed? (MersenneTwister is several times slower than C version)
Now, this will be hard to get close the the highly tuned C. Possibly its doable. The main tricks are documented here: http://haskell.org/haskellwiki/Performance/GHC#Unboxed_types Inspecting the Core to ensure the math is being inlined and unboxed will be the most crucial issue, I'd imagine. Then again, an FFI binding to mersenne.c is also a good idea :) -- Don isto.aho: Hi all, On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 1000th rnd num the output is (code shown below): $ time ./testMTla Testing Mersenne Twister. Result is [3063349438] real0m4.925s user0m4.856s I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below): $ time ./testMT Testing Mersenne Twister. 3063349438 real0m3.032s user0m3.004s The original C-version (modified so that only the last number is shown) gives typically $ time ./mt19937ar outputs of genrand_int32() 3063349438 real0m0.624s user0m0.616s Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case. I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else? Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot! br, Isto - testMTla.hs (MersenneTwister, see HaWiki) module Main where -- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision --make testMTla import MersenneTwister main = do putStrLn Testing Mersenne Twister. let mt = mersenneTwister 100 w = take 1 (drop 999 mt) -- w = take 1 (drop 99 mt) putStrLn $ Result is ++ (show w) - - testMT.hs module Main where -- Compile eg with -- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision --make testMT import Mersenne genRNums32 :: MT32 - Int - IO (MT32) genRNums32 mt nCnt = gRN mt nCnt where gRN :: MT32 - Int - IO (MT32) gRN mt nCnt | mt `seq` nCnt `seq` False = undefined gRN mt 1= do (r,mt') - next32 mt putStrLn $ (show r) return mt' gRN mt nCnt = do (r,mt') - next32 mt gRN mt' $! (nCnt-1) main = do putStrLn Testing Mersenne Twister. mt32 - initialiseGenerator32 100 genRNums32 mt32 1000 - - Mersenne.hs (sorry for linewraps) module Mersenne where import Data.Bits import Data.Word import Data.Array.Base import Data.Array.MArray import Data.Array.IO -- import System.Random data MT32 = MT32 (IOUArray Int Word32) Int data MT64 = MT64 (IOUArray Int Word64) Int last32bitsof :: Word32 - Word32 last32bitsof a = a .. 0x -- == (2^32-1) lm32 = 0x7fff :: Word32 um32 = 0x8000 :: Word32 mA32 = 0x9908b0df :: Word32 -- == 2567483615 -- Array of length 624. initialiseGenerator32 :: Int - IO MT32 initialiseGenerator32 seed = do let s = last32bitsof (fromIntegral seed)::Word32 mt - newArray (0,623) (0::Word32) unsafeWrite mt 0 s iG mt s 1 mt' - generateNumbers32 mt return (MT32 mt' 0) where iG :: (IOUArray Int Word32) - Word32 - Int - IO (IOUArray Int Word32) iG mt lastNro n | n == 624= return mt | otherwise = do let n1 = lastNro `xor` (shiftR lastNro 30) new = (1812433253 * n1 + (fromIntegral n)::Word32) unsafeWrite mt n new iG mt new (n+1) generateNumbers32 :: (IOUArray Int Word32) - IO (IOUArray Int Word32) generateNumbers32 mt = gLoop 0 mt where gLoop :: Int - (IOUArray Int Word32) - IO (IOUArray Int Word32) gLoop i mt | i==623 = do wL - unsafeRead mt 623 w0 - unsafeRead mt 0 w396 - unsafeRead mt 396 let y = (wL .. um32) .|. (w0 .. lm32) :: Word32 if even y then unsafeWrite mt 623 (w396 `xor` (shiftR y 1)) else unsafeWrite mt 623 (w396 `xor` (shiftR y 1) `xor` mA32) return mt
Re: [Haskell-cafe] How to improve speed? (MersenneTwister is several times slower than C version)
A big problem with the Mersenne Twister is the shifts. As has been noted elsewhere, ghc doesn't do such a great job on those. -- Lennart On Nov 1, 2006, at 20:17 , Donald Bruce Stewart wrote: Now, this will be hard to get close the the highly tuned C. Possibly its doable. The main tricks are documented here: http://haskell.org/haskellwiki/Performance/GHC#Unboxed_types Inspecting the Core to ensure the math is being inlined and unboxed will be the most crucial issue, I'd imagine. Then again, an FFI binding to mersenne.c is also a good idea :) -- Don isto.aho: Hi all, On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 1000th rnd num the output is (code shown below): $ time ./testMTla Testing Mersenne Twister. Result is [3063349438] real0m4.925s user0m4.856s I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below): $ time ./testMT Testing Mersenne Twister. 3063349438 real0m3.032s user0m3.004s The original C-version (modified so that only the last number is shown) gives typically $ time ./mt19937ar outputs of genrand_int32() 3063349438 real0m0.624s user0m0.616s Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case. I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else? Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot! br, Isto - testMTla.hs (MersenneTwister, see HaWiki) module Main where -- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision --make testMTla import MersenneTwister main = do putStrLn Testing Mersenne Twister. let mt = mersenneTwister 100 w = take 1 (drop 999 mt) -- w = take 1 (drop 99 mt) putStrLn $ Result is ++ (show w) - - testMT.hs module Main where -- Compile eg with -- ghc -O3 -optc-O3 -optc-ffast-math -fexcess-precision --make testMT import Mersenne genRNums32 :: MT32 - Int - IO (MT32) genRNums32 mt nCnt = gRN mt nCnt where gRN :: MT32 - Int - IO (MT32) gRN mt nCnt | mt `seq` nCnt `seq` False = undefined gRN mt 1= do (r,mt') - next32 mt putStrLn $ (show r) return mt' gRN mt nCnt = do (r,mt') - next32 mt gRN mt' $! (nCnt-1) main = do putStrLn Testing Mersenne Twister. mt32 - initialiseGenerator32 100 genRNums32 mt32 1000 - - Mersenne.hs (sorry for linewraps) module Mersenne where import Data.Bits import Data.Word import Data.Array.Base import Data.Array.MArray import Data.Array.IO -- import System.Random data MT32 = MT32 (IOUArray Int Word32) Int data MT64 = MT64 (IOUArray Int Word64) Int last32bitsof :: Word32 - Word32 last32bitsof a = a .. 0x -- == (2^32-1) lm32 = 0x7fff :: Word32 um32 = 0x8000 :: Word32 mA32 = 0x9908b0df :: Word32 -- == 2567483615 -- Array of length 624. initialiseGenerator32 :: Int - IO MT32 initialiseGenerator32 seed = do let s = last32bitsof (fromIntegral seed)::Word32 mt - newArray (0,623) (0::Word32) unsafeWrite mt 0 s iG mt s 1 mt' - generateNumbers32 mt return (MT32 mt' 0) where iG :: (IOUArray Int Word32) - Word32 - Int - IO (IOUArray Int Word32) iG mt lastNro n | n == 624= return mt | otherwise = do let n1 = lastNro `xor` (shiftR lastNro 30) new = (1812433253 * n1 + (fromIntegral n)::Word32) unsafeWrite mt n new iG mt new (n+1) generateNumbers32 :: (IOUArray Int Word32) - IO (IOUArray Int Word32) generateNumbers32 mt = gLoop 0 mt where gLoop :: Int - (IOUArray Int Word32) - IO (IOUArray Int Word32) gLoop i mt | i==623 = do wL - unsafeRead mt 623 w0 - unsafeRead mt 0 w396 - unsafeRead mt 396 let y = (wL .. um32) .|. (w0 .. lm32) :: Word32 if even y then unsafeWrite mt 623 (w396 `xor`
Re: [Haskell-cafe] How to improve speed? (MersenneTwister is several times slower than C version)
On 11/1/06, isto [EMAIL PROTECTED] wrote: Hi all, On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 1000th rnd num the output is (code shown below): $ time ./testMTla Testing Mersenne Twister. Result is [3063349438] real0m4.925s user0m4.856s I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below): $ time ./testMT Testing Mersenne Twister. 3063349438 real0m3.032s user0m3.004s The original C-version (modified so that only the last number is shown) gives typically $ time ./mt19937ar outputs of genrand_int32() 3063349438 real0m0.624s user0m0.616s Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case. I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else? Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot! br, Isto Greetings, Applying a few optimations can make it about 3x faster. 1. Hoist the array out of your loops. (See generateNumbers32, initialiseGenerator32 and genRNums). 2. Don't create too many new MT32 boxes. Most of the time is spent in 'next32' and changing its type to 'IOUArray Int Word32 - Int - IO (Word32, Int)' makes it much faster. 3. Demand more inlining. If you're using GHC, -funfolding-use-threshold=16 will substantially improve the performance. Using 'seq' is generally a bad idea. It can worsen the performance if not used carefully and GHCs strictness analyser is usually good enough. I used the profiler and -ddump-simpl to analyse this program. Donald suggested manual unboxing. However, in this case it won't help much (if at all) since GHC is doing such an excellent job on its own. -- Cheers, Lemmih ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How to improve speed? (MersenneTwister is several times slower than C version)
On 11/2/06, Lennart Augustsson [EMAIL PROTECTED] wrote: A big problem with the Mersenne Twister is the shifts. As has been noted elsewhere, ghc doesn't do such a great job on those. Actually, the shifts are only evaluated once (hurrah for lazy evaluation) and with -funfolding-use-threshold=16 they're all compiled to unchecked primitives (GHC.Prim.uncheckedShiftRL#). -- Cheers, Lemmih ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How to improve speed? (MersenneTwister is several times slower than C version)
The whole point of writing the Mersenne Twister was that I wanted to show how a stateful computation could be encapsulated in the ST monad and none of it showing up outside. This aspect of the code is totally gone now when everything is in the IO monad. Is there some good reason to have it in the IO monad? -- Lennart On Nov 1, 2006, at 20:51 , Lemmih wrote: On 11/1/06, isto [EMAIL PROTECTED] wrote: Hi all, On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 1000th rnd num the output is (code shown below): $ time ./testMTla Testing Mersenne Twister. Result is [3063349438] real0m4.925s user0m4.856s I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below): $ time ./testMT Testing Mersenne Twister. 3063349438 real0m3.032s user0m3.004s The original C-version (modified so that only the last number is shown) gives typically $ time ./mt19937ar outputs of genrand_int32() 3063349438 real0m0.624s user0m0.616s Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case. I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else? Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot! br, Isto Greetings, Applying a few optimations can make it about 3x faster. 1. Hoist the array out of your loops. (See generateNumbers32, initialiseGenerator32 and genRNums). 2. Don't create too many new MT32 boxes. Most of the time is spent in 'next32' and changing its type to 'IOUArray Int Word32 - Int - IO (Word32, Int)' makes it much faster. 3. Demand more inlining. If you're using GHC, -funfolding-use-threshold=16 will substantially improve the performance. Using 'seq' is generally a bad idea. It can worsen the performance if not used carefully and GHCs strictness analyser is usually good enough. I used the profiler and -ddump-simpl to analyse this program. Donald suggested manual unboxing. However, in this case it won't help much (if at all) since GHC is doing such an excellent job on its own. -- Cheers, Lemmih ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How to improve speed? (MersenneTwister is several times slower than C version)
lemmih: On 11/1/06, isto [EMAIL PROTECTED] wrote: Hi all, On HaWiki was an announcement of MersenneTwister made by Lennart Augustsson. On a typical run to find out 1000th rnd num the output is (code shown below): $ time ./testMTla Testing Mersenne Twister. Result is [3063349438] real0m4.925s user0m4.856s I was exercising with the very same algorithm and tried to make it efficient (by using IOUArray): now a typical run looks like (code shown below): $ time ./testMT Testing Mersenne Twister. 3063349438 real0m3.032s user0m3.004s The original C-version (modified so that only the last number is shown) gives typically $ time ./mt19937ar outputs of genrand_int32() 3063349438 real0m0.624s user0m0.616s Results are similar with 64 bit IOUArray against 64 bit C variant. C seems to work about 5 to 10 times faster in this case. I have tried to do different things but now I'm stuck. unsafeRead and unsafeWrite improved a bit the lazy (STUArray-version) and IOUArray-versions but not very much. I took a look of Core file but then, I'm not sure where the boxed values are ok. E.g. should IOUArray Int Word64 be replaced with something else? Any hints and comments on how to improve the efficiency and make everything better will be appreciated a lot! br, Isto Greetings, Applying a few optimations can make it about 3x faster. 1. Hoist the array out of your loops. (See generateNumbers32, initialiseGenerator32 and genRNums). 2. Don't create too many new MT32 boxes. Most of the time is spent in 'next32' and changing its type to 'IOUArray Int Word32 - Int - IO (Word32, Int)' makes it much faster. 3. Demand more inlining. If you're using GHC, -funfolding-use-threshold=16 will substantially improve the performance. Using 'seq' is generally a bad idea. It can worsen the performance if not used carefully and GHCs strictness analyser is usually good enough. I used the profiler and -ddump-simpl to analyse this program. Donald suggested manual unboxing. However, in this case it won't help much (if at all) since GHC is doing such an excellent job on its own. I wasn't suggesting manual unboxing, more that you should carefully inspect the Core, and tune with bang patterns where necessary. -funfolding-use-threshold=16 is a good idea though. or =100 ;) -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] How to improve speed? (MersenneTwister is several times slower than C version)
Hi, When writing IO version, I wasn't aware of other twister versions, and the only reason is/was that it was easiest to me and that I knew (believed) that plain lists would have been inefficient. I just wanted to see and learn, how close to C version this can be made. (And still do.) There were some good suggestions on this thread - next I'll try to get grasp on how to apply the suggestions and do something... br, Isto ke, 2006-11-01 kello 22:04 -0500, Lennart Augustsson kirjoitti: The whole point of writing the Mersenne Twister was that I wanted to show how a stateful computation could be encapsulated in the ST monad and none of it showing up outside. This aspect of the code is totally gone now when everything is in the IO monad. Is there some good reason to have it in the IO monad? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe