Re: [Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation]
Hello Donald, Friday, June 8, 2007, 5:42:41 AM, you wrote: Previous experience[1] indicates it is pretty hard to write a C line parsing program[2] that that run this fast. And the code, with comments: [2] uses gets() function while your haskell code read whole buffer each time. that is the obvious source of slowness -- Best regards, Bulatmailto:[EMAIL PROTECTED] ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation]
mdanish: Hello, I've been playing with the INTEST problem on SPOJ which demonstrates the ability to write a program which processes large quantities of input data. http://www.spoj.pl/problems/INTEST/ But when I make a slight modification, the program chews up a ton more memory and takes more time: import Control.Monad import Data.Maybe import qualified Data.ByteString.Char8 as B divisibleBy :: Int - Int - Bool a `divisibleBy` n = a `rem` n == 0 main :: IO () main = do [n,k] - (map int . B.split ' ') `fmap` B.getLine :: IO [Int] let doLine :: Int - Int - IO Int doLine r _ = B.getLine = return . testDiv r -- 'return' moved here ^^ What follows is a solution to the original question, and then a dramatic rewrite, showing the fastest way (that I know of) to parse \n separated lists of numbers in Haskell. The results should outperform C fairly well. *** Solution 1: don't be so lazy in the fold. First, look at that lazy fold. A simple fix, try being explict about forcing your accumulator: doLine :: Int - Int - IO Int doLine r _ = B.getLine = \s - return $! testDiv r s And some timing data: Original: $ time ./A in 29359 ./A in 1.52s user 0.06s system 93% cpu 1.679 total Too lazy: $ time ./B in 29359 ./B in 3.84s user 0.26s system 82% cpu 4.957 total Hand back some strictness hints: $ time ./D in 29359 ./D in 1.52s user 0.03s system 94% cpu 1.637 total *** Solution 2: use lazy bytestrings to avoid gunky IO Now, however, I'd give up on that explict getLine stuff, and use a lazy bytestring. Something like this: import Data.Maybe import Data.List import qualified Data.ByteString.Lazy.Char8 as L main :: IO () main = do (l:ls) - L.lines `fmap` L.getContents -- done with IO now. let [n,k] = map int (L.split ' ' l) print . foldl' (test k) 0 . map int . take n $ ls test :: Int - Int - Int - Int test k acc n | n `divisibleBy` k = acc+1 | otherwise = acc int :: L.ByteString - Int int = fst . fromJust . L.readInt divisibleBy :: Int - Int - Bool a `divisibleBy` n = a `rem` n == 0 The general rule for bytestring loops is to avoid IO, and to use lazy bytestrings if you need 'lines'. Also, program in a high level, using combinators, rather than your own loops, so that fusion will kick in (we get some list fusion here). And running it: $ time ./C in 29359 ./C in 1.22s user 0.04s system 94% cpu 1.335 total Ok, faster, and cleaner. Avoid mixing IO into your code! *** Solution 3: 4x faster by processing strict cache chunks Now the fun part. The following code is the fastest way I know to process lists of numbers (in any language). Its' based on similar code I wrote for the language shootout. The key trick is to use lazy bytestrings *only* as a method for filling the cache with newline-aligned chunks of numbers. Once you've got that perfectly-sized chunk, walk its lines, and process them. This is all done in Haskell, and relies on an understanding of the low level details of bytestring optimisations. The general framework could be reused for any code that needs to process a list of numbers in a file, where you care about speed. It performs as follows: $ time ./F in 29359 ./F in 0.24s user 0.01s system 76% cpu 0.327 total Pretty fast.. Previous experience[1] indicates it is pretty hard to write a C line parsing program[2] that that run this fast. And the code, with comments: 1. http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcollang=ghcid=0 2. http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcollang=gccid=2 {-# OPTIONS -fbang-patterns #-} import Data.Char import Data.Maybe import Data.ByteString.Base import qualified Data.ByteString.Char8 as S import qualified Data.ByteString.Lazy.Char8 as L main = do ss - L.getContents -- done with IO now. let (l,ls) = L.break (=='\n') ss -- don't need count, we're allocating lazily k = fst . fromJust . L.readInt . last . L.split ' ' $ l file = L.toChunks (L.tail ls) -- a lazy list of strict cache chunks print $ process k 0 file divisibleBy :: Int - Int - Bool a `divisibleBy` n = a `rem` n == 0 -- - -- -- Optimised parsing of strict bytestrings representing \n separated numbers -- -- -- we have the file as a list of cache chunks -- align them on \n boundaries, and process each chunk separately -- when the next chunk is demanded, it will be read in. -- process :: Int - Int - [S.ByteString] - Int process k i [] = i process k !i (s:t:ts) | S.last s /= '\n' = process k (add k i s') ts' where (s',r) = S.breakEnd (=='\n') s ts'=
Re: [Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation]
dons: mdanish: Hello, I've been playing with the INTEST problem on SPOJ which demonstrates the ability to write a program which processes large quantities of input data. http://www.spoj.pl/problems/INTEST/ But when I make a slight modification, the program chews up a ton more memory and takes more time: import Control.Monad import Data.Maybe import qualified Data.ByteString.Char8 as B divisibleBy :: Int - Int - Bool a `divisibleBy` n = a `rem` n == 0 main :: IO () main = do [n,k] - (map int . B.split ' ') `fmap` B.getLine :: IO [Int] let doLine :: Int - Int - IO Int doLine r _ = B.getLine = return . testDiv r -- 'return' moved here ^^ And just following up with some GC statistics: Original, 95% cpu 1.668 total ghc: 258766440 bytes, 452 GCs, 3036/3036 avg/max bytes residency (1 samples), 3M in use, 0.00 INIT (0.00 elapsed), 1.51 MUT (1.63 elapsed), 0.01 GC (0.03 elapsed) :ghc Too lazy: 96% cpu 4.219 total ghc: 278683532 bytes, 495 GCs, -- 14729345/52642396 avg/max bytes residency (7 samples), -- 85M in use, 0.00 INIT (0.00 elapsed), 1.68 MUT (1.81 elapsed), -- 2.07 GC (2.36 elapsed) :ghc (clear space leak) Fixing above program with $!: 94% cpu 1.656 total ghc: 257394052 bytes 451 GCs, --2288/2288 avg/max bytes residency (1 samples), --1M in use, 0.00 INIT (0.00 elapsed), 1.49 MUT (1.64 elapsed), 0.01 GC (0.01 elapsed) :ghc Using lazy bytestrings for pure processing: 90% cpu 1.424 total ghc: 219403252 bytes, 410 GCs, 70527/74236 avg/max bytes residency (10 samples), 2M in use, 0.00 INIT (0.00 elapsed), 1.25 MUT (1.40 elapsed), 0.01 GC (0.01 elapsed) :ghc And the killer strict chunk parser: 78% cpu 0.327 total ghc: 20685092 bytes, --38 GCs, --81348/81348 avg/max bytes residency (1 samples), 2M in use, 0.00 INIT (0.00 elapsed), --0.21 MUT (0.32 elapsed), 0.00 GC (0.00 elapsed) :ghc Very little data shuffled around in the last one. -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Fast number parsing with strict bytestrings [Was: Re: Seemingly subtle change causes large performance variation]
dons: dons: mdanish: Hello, I've been playing with the INTEST problem on SPOJ which demonstrates the ability to write a program which processes large quantities of input data. http://www.spoj.pl/problems/INTEST/ But when I make a slight modification, the program chews up a ton more memory and takes more time: import Control.Monad import Data.Maybe import qualified Data.ByteString.Char8 as B divisibleBy :: Int - Int - Bool a `divisibleBy` n = a `rem` n == 0 main :: IO () main = do [n,k] - (map int . B.split ' ') `fmap` B.getLine :: IO [Int] let doLine :: Int - Int - IO Int doLine r _ = B.getLine = return . testDiv r -- 'return' moved here ^^ Original, 95% cpu 1.668 total ghc: 258766440 bytes, 452 GCs, 3036/3036 avg/max bytes residency (1 samples), 3M in use, 0.00 INIT (0.00 elapsed), 1.51 MUT (1.63 elapsed), 0.01 GC (0.03 elapsed) :ghc And the killer strict chunk parser: 78% cpu 0.327 total ghc: 20685092 bytes, --38 GCs, --81348/81348 avg/max bytes residency (1 samples), 2M in use, 0.00 INIT (0.00 elapsed), --0.21 MUT (0.32 elapsed), 0.00 GC (0.00 elapsed) :ghc I note there was a missing constructor specialisation happening in the calls to 'add', in the good program. We can fix that with some well place inline pragma: add :: Int - Int - S.ByteString - Int add k i s = if S.null s then i else test k i (parse x) xs where (x,xs) = uncons s {-# INLINE add #-} Before, GHC -ddump-simpl-stats reported: 22 RuleFired 2 SC:$wprocess1 4 SC:$wprocess2 2 SC:comb1 After: 24 RuleFired 4 SC:$wprocess1 4 SC:$wprocess2 2 SC:comb1 And timing stats: $ time ./F in 29359 ./F in 0.20s user 0.04s system 81% cpu 0.288 total So some 10% better. It's often a good idea to inline non-recursive wrapper functions like this, in bytestring code. -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe