Wow that sped it up 5 times.I see that boxed Vector is 25% faster than IOArray.What is the difference and when to use Vector,when IOArray?Thanks! bmaxa@maxa:~/examples$ time ./Cumul +RTS -A1600Mboxed arraylast 262486571 seconds 1.196unboxed arraylast 262486571 seconds 0.748boxed vectorlast 262486571 seconds 0.908unboxed vectorlast 262486571 seconds 0.720 real 0m3.805suser 0m3.428ssys 0m0.372s
> Date: Sat, 1 Dec 2012 12:20:37 -0500 > Subject: Re: [Haskell-cafe] Why is boxed mutable array so slow? > From: [email protected] > To: [email protected] > CC: [email protected] > > The obvious difference between boxed and unboxed arrays is that the > boxed arrays are full of pointers to heap allocated objects. This > means you pay indirection to access the values, much more time in GC > spent chasing pointers (though card marking helps), and generally do > more allocation. > > Compare the GC stats below, for > > * Boxed vector: 88M bytes copied; 75% of time in GC, 0.472s > * Unboxed vector: 11k bytes copied, 1.3% of time in GC, 0.077s > > So there's your main answer. The increased data density of unboxed > arrays also helps a too. > > Now, you can help out the GC signifcantly by hinting at how much > you're going to allocated in the youngest generation (see the > ghc-gc-tune app for a methodical approach to this, though it needs > updating to ghc 7 -- > http://donsbot.wordpress.com/2010/07/05/ghc-gc-tune-tuning-haskell-gc-settings-for-fun-and-profit/ > and > http://stackoverflow.com/questions/3171922/ghcs-rts-options-for-garbage-collection > ). > > Use the +RTS -A flag to set an initial youngest generation heap size > to the size of your array, and watch the GC cost disappear. For our > boxed vector, we'd use +RTS -A50M, resulting in: > > * Boxed vector: 8k copied, 1% of time in GC, 0.157s > > So not bad. 3x speedup through a RTS flag. -A is very useful if you > are working with boxed, mutable arrays. > > For reference, there's a generic version below that specializes based > on the vector type parameter. > > --------------------------------- > > {-# LANGUAGE BangPatterns #-} > > import System.CPUTime > import Text.Printf > import Data.Int > import Control.DeepSeq > import System.Mem > > import qualified Data.Vector.Mutable as V > import qualified Data.Vector.Unboxed.Mutable as U > import qualified Data.Vector.Generic.Mutable as G > > main :: IO() > main = do > > -- (G.new n' :: IO (V.IOVector Int32)) >>= test' "boxed vector" > -- performGC > (G.new n' :: IO (U.IOVector Int32)) >>= test' "unboxed vector" > performGC > > test' s a = do > putStrLn s > begin <- getCPUTime > init'' a > partial_sum' a > end <- getCPUTime > let diff = (fromIntegral (end - begin)) / (10**12) > last <- G.read a (n'-1) > printf "last %d seconds %.3f\n" last (diff::Double) > > n' :: Int > n' = 1000 * 1000 > > init'' !a = init 0 (n'-1) > where > init :: Int -> Int -> IO () > init !k !n > | k > n = return () > | otherwise = do > let !x = fromIntegral $ k + k `div` 3 > G.write a k x > init (k+1) n > > > > partial_sum' !a = do > k <- G.read a 0 > ps 1 (n'-1) k > where > ps :: Int -> Int -> Int32 -> IO () > ps i n s > | i > n = return () > | otherwise = do > k <- G.read a i > let !l = fromIntegral $ s + k > G.write a i l > ps (i+1) n l > > > --------------------------------- > > $ time ./A +RTS -s > boxed vector > last 945735787 seconds 0.420 > 40,121,448 bytes allocated in the heap > 88,355,272 bytes copied during GC > 24,036,456 bytes maximum residency (6 sample(s)) > 380,632 bytes maximum slop > 54 MB total memory in use (0 MB lost due to fragmentation) > > %GC time 75.2% (75.9% elapsed) > > Alloc rate 359,655,602 bytes per MUT second > > ./A +RTS -s 0.40s user 0.07s system 98% cpu 0.475 total > > > $ time ./A +RTS -s > unboxed vector > last 945735787 seconds 0.080 > 4,113,568 bytes allocated in the heap > 11,288 bytes copied during GC > 4,003,256 bytes maximum residency (3 sample(s)) > 182,856 bytes maximum slop > 5 MB total memory in use (0 MB lost due to fragmentation) > > %GC time 1.3% (1.3% elapsed) > > Alloc rate 51,416,660 bytes per MUT second > > ./A +RTS -s 0.08s user 0.01s system 98% cpu 0.088 total > > > $ time ./A +RTS -A50M -s > boxed vector > last 945735787 seconds 0.127 > 40,121,504 bytes allocated in the heap > 8,032 bytes copied during GC > 44,704 bytes maximum residency (2 sample(s)) > 20,832 bytes maximum slop > 59 MB total memory in use (0 MB lost due to fragmentation) > > %GC time 1.0% (1.0% elapsed) > > Productivity 97.4% of total user, 99.6% of total elapsed > > ./A +RTS -A50M -s 0.10s user 0.05s system 97% cpu 0.157 total > > > > --------------------------------- > > > On Sat, Dec 1, 2012 at 11:09 AM, Branimir Maksimovic <[email protected]> > wrote: > > I have made benchmark test inspired by > > http://lemire.me/blog/archives/2012/07/23/is-cc-worth-it/ > > > > What surprised me is that unboxed array is much faster than boxed array. > > Actually boxed array performance is on par with standard Haskell list > > which is very slow. > > All in all even unboxed array is about 10 times slower than Java version. > > I don't understand why is even unboxed array so slow. > > But! unboxed array consumes least amount of RAM. > > (warning, program consumes more than 3gb of ram) > > > > bmaxa@maxa:~/examples$ time ./Cumul > > boxed array > > last 262486571 seconds 4.972 > > unboxed array > > last 262486571 seconds 0.776 > > list > > last 262486571 seconds 6.812 > > > > real 0m13.086s > > user 0m11.996s > > sys 0m1.080s > > > > ------------------------------------------------------------------------- > > {-# LANGUAGE CPP, BangPatterns #-} > > import System.CPUTime > > import Text.Printf > > import Data.Array.IO > > import Data.Array.Base > > import Data.Int > > import Control.DeepSeq > > import System.Mem > > > > main :: IO() > > main = do > > (newArray_ (0,n'-1) :: IO(A)) >>= test "boxed array" > > performGC > > (newArray_ (0,n'-1) :: IO(B)) >>= test "unboxed array" > > performGC > > begin <- getCPUTime > > printf "list\nlast %d" $ last $ force $ take n' $ sum' data' > > end <- getCPUTime > > let diff = (fromIntegral (end - begin)) / (10^12) > > printf " seconds %.3f\n" (diff::Double) > > > > test s a = do > > putStrLn s > > begin <- getCPUTime > > init' a > > partial_sum a > > end <- getCPUTime > > let diff = (fromIntegral (end - begin)) / (10^12) > > last <- readArray a (n'-1) > > printf "last %d seconds %.3f\n" last (diff::Double) > > > > n' :: Int > > n' = 50 * 1000 * 1000 > > > > type A = IOArray Int Int32 > > type B = IOUArray Int Int32 > > > > init' a = do > > (_,n) <- getBounds a > > init a 0 n > > where > > init a k n > > | k > n = return () > > | otherwise = do > > let !x = fromIntegral $ k + k `div` 3 > > unsafeWrite a k x > > init a (k+1) n > > > > partial_sum a = do > > (_,n) <- getBounds a > > k <- unsafeRead a 0 > > ps a 1 n k > > where > > ps a i n s > > | i > n = return () > > | otherwise = do > > k <- unsafeRead a i > > let !l = fromIntegral $ s + k > > unsafeWrite a i l > > ps a (i+1) n l > > > > data' :: [Int32] > > data' = [k + k `div` 3 | k <- [0..] ] > > > > sum' = scanl1 (+) > > > > > > _______________________________________________ > > Haskell-Cafe mailing list > > [email protected] > > http://www.haskell.org/mailman/listinfo/haskell-cafe > >
_______________________________________________ Haskell-Cafe mailing list [email protected] http://www.haskell.org/mailman/listinfo/haskell-cafe
