Re: [Haskell-cafe] Space leak in hexpat-0.20.3/List-0.5.1
Wren Thornton wrote: So I'm processing a large XML file which is a database of about 170k entries, each of which is a reasonable enough size on its own, and I only need streaming access to the database (basically printing out summary data for each entry). Excellent, sounds like a job for SAX. Indeed a good job for a SAX-like parser. XMLIter is exactly such parser, and it generates event stream quite like that of Expat. Also you application is somewhat similar to the following http://okmij.org/ftp/Haskell/Iteratee/XMLookup.hs So, it superficially seems XMLIter should be up for the task. Can you explain which elements your are counting? BTW, xml_enum already checks for the well-formedness of XML (including the start-end tag balance, and many more criteria). One can assume that the XMLStream corresponds to the well-formed document and only count the desired start tags (or end tags, for that matter). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Space leak in hexpat-0.20.3/List-0.5.1
Hello all, So I'm processing a large XML file which is a database of about 170k entries, each of which is a reasonable enough size on its own, and I only need streaming access to the database (basically printing out summary data for each entry). Excellent, sounds like a job for SAX. However, after whipping up a simplified version of the program using hexpat, there's a space leak. Near as I can tell, it's not a problem with my code, it's a problem with Data.List.Class (or hexpat's use thereof). The simplified code follows, just compile it for profiling and use hp2ps to see what I mean. The file I'm running it on can be found at: ftp://ftp.monash.edu.au/pub/nihongo/JMdict.gz Any ideas on what the problem really is, or how to fix it? module JMdict (main) where import Text.XML.Expat.SAX (SAXEvent(..)) import qualified Text.XML.Expat.SAX as SAX import Text.XML.Expat.Tree (NodeG(..)) import qualified Text.XML.Expat.Tree as DOM import qualified Text.XML.Expat.Proc as Proc import qualified Text.XML.Expat.Internal.NodeClass as Node import qualified Data.ByteString.Lazy as BL import Data.Char(isSpace) import Data.Text.IO as TIO import qualified Data.Textas T import Control.Applicative (($)) import Control.Monad(forM_) import qualified System.IOas IO import qualified System.Environment as Sys (getArgs) import qualified System.Exit as Sys (exitFailure) import qualified System.Directory as Sys (doesFileExist, getPermissions, readable) -- | A variant of 'Control.Monad.unless' for when the boolean is -- also monadic. unlessM :: Monad m = m Bool - m () - m () unlessM mb handle = do b - mb if b then return () else handle -- | If the file does not exist or is not readable, then crash the -- program. assertFileExistsReadable :: FilePath - IO () assertFileExistsReadable file = do unlessM (Sys.doesFileExist file) $ do IO.hPutStrLn IO.stderr $ No such file: ++file Sys.exitFailure unlessM (Sys.readable $ Sys.getPermissions file) $ do IO.hPutStrLn IO.stderr $ File not readable: ++file Sys.exitFailure main :: IO () main = do files - Sys.getArgs forM_ files $ \file - do assertFileExistsReadable file countElements 0 . filter (not . isWhitespace) . dropPreamble . SAX.parse SAX.defaultParseOptions = BL.readFile file where dropPreamble (StartElement t _ : xs) | t == T.pack JMdict = xs dropPreamble (_:xs) = dropPreamble xs dropPreamble [] = [] isWhitespace (CharacterData c) | T.all isSpace c = True isWhitespace _ = False countElements :: Int - [SAXEvent T.Text T.Text] - IO () countElements n [] = print n countElements n xs = case anyElement xs of (Left err, xs') - fail $ show err ++: ++ show (take 3 xs') (Right ell, xs') - do print (n+1) (countElements $! n+1) xs' data ElementError = EmptyStream | NoStartElement | EndOfStream | InvalidXML deriving (Read, Show, Eq) -- | Split an event stream into an initial element and the remainder -- of the stream. Use 'DOM.saxToTree' to convert the element to a -- tree. anyElement :: (Eq tag) = [SAXEvent tag text] - (Either ElementError [SAXEvent tag text], [SAXEvent tag text]) anyElement = start where start [] = (Left EmptyStream, []) start xxs@(x:xs) = case x of StartElement t _ - loop [t] (x:) xs _- (Left NoStartElement, xxs) loop _ _ [] = (Left EndOfStream, []) loop [] k xs = (Right (k []), xs) loop tts@(t:ts) k xxs@(x:xs) = step (k . (x:)) xs where step = case x of StartElement t' _ - loop (t':tts) EndElement t' | t' == t - loop ts | otherwise - \_ _ - (Left InvalidXML, xxs) _ - loop tts --- fin. -- Live well, ~wren ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] space leak when repeatedly calling Control.Monad.State.Strict.modify
Have you tried to compile your code with optimisations? I guess GHC's strictness analysis would find strict evaluation is better here. 2012/1/30 Joey Hess j...@kitenet.net Claude Heiland-Allen wrote: Control.Monad.State.Strict is strict in the actions, but the state itself is still lazy, so you end up building a huge thunk in the state containing all the updates that ever took place to the initial state. Using this should fix it: modify' :: MonadState s m = (s - s) - m () modify' f = do x - get put $! f x -- force the new state when storing it Thanks! So, why does Control.Monad.State.Strict.modify not do that? And, I still don't quite understand why this only happened when the updated value is obtained using IO. -- see shy jo ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] space leak when repeatedly calling Control.Monad.State.Strict.modify
Yves Parès wrote: Have you tried to compile your code with optimisations? I guess GHC's strictness analysis would find strict evaluation is better here. The original code I saw this happen to the wild was built with -O2. I didn't try building the test case with optimisations. -- see shy jo signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] space leak when repeatedly calling Control.Monad.State.Strict.modify
The attached test case quickly chews up hundreds of MB of memory. If modified to call work' instead, it runs in constant space. Somehow the value repeatedly read in from the file and stored in the state is leaking. Can anyone help me understand why? (ghc 7.0.4) -- see shy jo {-# LANGUAGE GeneralizedNewtypeDeriving, BangPatterns #-} module Main where import Control.Monad.State.Strict data MyState = MyState { val :: String } newtype Foo a = Foo { run :: StateT MyState IO a } deriving ( Monad, MonadState MyState, MonadIO ) main :: IO () main = evalStateT (run test) (MyState ) test :: Foo () test = mapM_ work [1..10]-- massive memory leak --test = mapM_ work' [1..10] -- constant space readSomeFile :: Foo String readSomeFile = liftIO $ readFileStrict /etc/passwd work :: Integer - Foo () work _ = do v - readSomeFile modify $ \s - s { val = v } work' :: Integer - Foo () work' n = do _ - readSomeFile modify $ \s - s { val = show n } readFileStrict :: FilePath - IO String readFileStrict file = do s - readFile file length s `seq` return s signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] space leak when repeatedly calling Control.Monad.State.Strict.modify
Hi, On 30/01/12 01:07, Joey Hess wrote: The attached test case quickly chews up hundreds of MB of memory. If modified to call work' instead, it runs in constant space. Somehow the value repeatedly read in from the file and stored in the state is leaking. Can anyone help me understand why? Control.Monad.State.Strict is strict in the actions, but the state itself is still lazy, so you end up building a huge thunk in the state containing all the updates that ever took place to the initial state. Using this should fix it: modify' :: MonadState s m = (s - s) - m () modify' f = do x - get put $! f x -- force the new state when storing it With the attached code, the first case (using modify) prints out a trace like: test work:1 modify work:2 modify work:3 modify work:4 modify work:5 modify work:6 modify work:7 modify work:8 modify work:9 modify work:10 modify update:vnbz update:dzgd update:hzla update:nudd update:bzfl update:muht update:hims update:jakj update:lvrt update:qdxo initial MyState {val = vnbz} Notice how the state updates are only evaluated right at the end, when the value is forced - note also that this means that all the data needs to hang around until then. The second case (using modify') forces the state as it goes along: test' work:1 modify' update:zwre initial work:2 modify' update:fefg work:3 modify' update:eoqa work:4 modify' update:xtak work:5 modify' update:tekd work:6 modify' update:qrsz work:7 modify' update:fdgj work:8 modify' update:alwj work:9 modify' update:kqsp work:10 modify' update:lazz MyState {val = lazz} Claude {-# LANGUAGE GeneralizedNewtypeDeriving #-} module Main where import Debug.Trace import System.Random import Control.Monad.State.Strict modify' :: MonadState s m = (s - s) - m () modify' f = do x - get put $! f x data MyState = MyState { val :: String } deriving Show newtype Foo a = Foo { run :: StateT MyState IO a } deriving ( Monad, MonadState MyState, MonadIO ) main :: IO () main = do print = execStateT (run test) (trace initial $ MyState ) print = execStateT (run test') (trace initial $ MyState ) test :: Foo () test = trace test $ mapM_ work [1..10]-- massive memory leak test' :: Foo () test' = trace test' $ mapM_ work' [1..10] work :: Integer - Foo () work n = trace (work:++show n) $ do v - readSomeFile trace modify modify $ trace (update:++v) (\s - s { val = v }) work' :: Integer - Foo () work' n = trace (work:++show n) $ do v - readSomeFile trace modify' modify' $ trace (update:++v) (\s - s { val = v }) readSomeFile :: Foo String readSomeFile = liftIO $ replicateM 4 (randomRIO ('a', 'z')) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] space leak when repeatedly calling Control.Monad.State.Strict.modify
Claude Heiland-Allen wrote: Control.Monad.State.Strict is strict in the actions, but the state itself is still lazy, so you end up building a huge thunk in the state containing all the updates that ever took place to the initial state. Using this should fix it: modify' :: MonadState s m = (s - s) - m () modify' f = do x - get put $! f x -- force the new state when storing it Thanks! So, why does Control.Monad.State.Strict.modify not do that? And, I still don't quite understand why this only happened when the updated value is obtained using IO. -- see shy jo signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak with unsafePerformIO
On Sun, 27 Jun 2010, Henning Thielemann wrote: Maybe I can combine splitAtLazy and (++) to a function like splitAtAndAppend :: [x] - ([a] - [b]) - ([a] - [b]) - [a] - [b] but I'm afraid I will need pairs temporarily and then I run into the same problems. I have now implemented a solution that is tailored to special function types in the first ([a] - [b]) argument. It works, but it is sad that the general, more declarative solution is so fragile. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak with unsafePerformIO
Henning Thielemann wrote: Attached is a program with a space leak that I do not understand. I have coded a simple 'map' function, once using unsafePerformIO and once without. UnsafePerformIO has a space leak in some circumstances. In the main program I demonstrate cases with and without space leak. Without space leak the program writes a file to the disk until it is full. Any idea? The program relies on the GC doing short-cut evaluation of record selectors to avoid a space leak. If the user of the function splitAtLazy | splitAtLazy :: [b] - [a] - ([a],[a]) | splitAtLazy nt xt = |(\ ~(ys,zs) - (ys,zs)) $ |case (nt,xt) of | (_:ns, x:xs) - | let (ys,zs) = splitAtLazy ns xs | in (x:ys,zs) | (_, xs) - ([],xs) somehow holds on to a reference of the returned pair while processing the first part of the list, there will be a space leak, because that means that the whole prefix remains reachable. splitAtLazy itself is not leaky, because the value returned by the recursive call is scrutinized as follows, Main.$wsplitAtLazy = ... (# case ds_sLb of wild_B1 { (ys_agC, zs_agE) - ys_agC }, case ds_sLb of wild_B1 { (ys_agC, zs_agE) - zs_agE } #) and ghc turns that into record selector thunks in the code generator. The precise rule can be found in compiler/codeGen/StgCmmBind.hs: | Note [Selectors] | ~~~ | We look at the body of the closure to see if it's a selector---turgid, | but nothing deep. We are looking for a closure of {\em exactly} the | form: | | ... = [the_fv] \ u [] - | case the_fv of |con a_1 ... a_n - a_i Now let's look at how the result of splitAtLazy is used. non-leaky version (case 0): Main.lvl1 = case Main.$wsplitAtLazy @ () @ GHC.Types.Char Main.xs Main.xs1 of ww_sLv { (# ww1_sLx, ww2_sLy #) - Main.go (GHC.Base.++ @ GHC.Types.Char ww1_sLx ww2_sLy) } The return values are passed on to (++) directly. The result pair is actually never built at all, so no reference to it can be kept. leaky version (case 3): Main.ds = case Main.$wsplitAtLazy @ () @ GHC.Types.Char Main.xs Main.xs1 of ww_sKy { (# ww1_sKA, ww2_sKB #) - (ww1_sKA, ww2_sKB) } This builds the pair returned by splitAtLazy. Main.lvl1 = case Main.ds of wild_Xw { (prefix_aCf, suffix_aCh) - prefix_aCf } Use of prefix: it's a record selector. This is fine. Main.lvl2 = Main.go Main.lvl1 The prefix is then passed to some worker function. Main.lvl3 = case Main.ds of wild_Xw { (prefix_aCf, suffix_aCh) - Main.go1 suffix_aCh } Use of suffix: Due to the call of Main.go1 this is *not* a record selector. It is compiled to an actual case expression, which to the garbage collector looks just like an ordinary thunk. A reference to Main.ds is kept around until the suffix is about to be processed and a memory leak ensues. If the compiler had produced Main.lvl3 = case Main.ds of wild_Xw { (prefix_aCf, suffix_aCh) - suffix_aCh } Main.lvl4 = Main.go1 Main.lvl3 instead, then there would not be a leak. This whole record selector thunk business is very fragile. The good news is that the problem is completely unrelated to unsafePerformIO (the presence of unsafePerformIO makes optimisations more difficult, but any pure function of sufficient complexity would have the same effect). There's a simple fix for the problem, too: Change let (prefix, suffix) = makeTwoLists 'a' to let !(prefix, suffix) = makeTwoLists 'a' in which case the compiler produces code similar to the non-leaky case for all alternatives. HTH, Bertram ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak with unsafePerformIO
On Sun, 27 Jun 2010, Bertram Felgenhauer wrote: If the compiler had produced Main.lvl3 = case Main.ds of wild_Xw { (prefix_aCf, suffix_aCh) - suffix_aCh } Main.lvl4 = Main.go1 Main.lvl3 instead, then there would not be a leak. This whole record selector thunk business is very fragile. Bertram, thank you a lot for the detailed analysis! It seems that I have stepped into a nasty implementation detail of GHC. The good news is that the problem is completely unrelated to unsafePerformIO (the presence of unsafePerformIO makes optimisations more difficult, but any pure function of sufficient complexity would have the same effect). There's a simple fix for the problem, too: Change let (prefix, suffix) = makeTwoLists 'a' to let !(prefix, suffix) = makeTwoLists 'a' in which case the compiler produces code similar to the non-leaky case for all alternatives. Actually this fix works for my example program and another one that is based on chunky StorableVector. But when I switch off profiling the StorableVector example runs into a space leak, again, while the one with plain lists that I have sent here still works. I'll investigate into this, but it seems indeed very fragile and I wonder whether there is a more reliable way. Maybe I can combine splitAtLazy and (++) to a function like splitAtAndAppend :: [x] - ([a] - [b]) - ([a] - [b]) - [a] - [b] but I'm afraid I will need pairs temporarily and then I run into the same problems. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Space leak with unsafePerformIO
Attached is a program with a space leak that I do not understand. I have coded a simple 'map' function, once using unsafePerformIO and once without. UnsafePerformIO has a space leak in some circumstances. In the main program I demonstrate cases with and without space leak. Without space leak the program writes a file to the disk until it is full. Any idea? The original problem is a function that is compiled by LLVM and shall be applied to a list in a mapAccumL manner. {- $ ghc --make -Wall -O -prof -auto-all InterleaveIO.hs $ InterleaveIO +RTS -M1m -prof-all -RTS -} import System.IO.Unsafe (unsafePerformIO, unsafeInterleaveIO, ) makeSuccUnsafe1 :: IO ([Char] - [Char]) makeSuccUnsafe1 = return $ \ sig - unsafePerformIO $ do let go xt = unsafeInterleaveIO $ case xt of [] - return [] x:xs - fmap (succ x :) $ go xs go sig makeSuccUnsafe :: IO ([Char] - [Char]) makeSuccUnsafe = return $ \ sig - let go xt = unsafePerformIO $ case xt of [] - return [] x:xs - return (succ x : go xs) in go sig makeSuccPlain :: IO ([Char] - [Char]) makeSuccPlain = return $ \ sig - let go xt = case xt of [] - [] x:xs - succ x : go xs in go sig splitAtLazy :: [b] - [a] - ([a],[a]) splitAtLazy nt xt = (\ ~(ys,zs) - (ys,zs)) $ case (nt,xt) of (_:ns, x:xs) - let (ys,zs) = splitAtLazy ns xs in (x:ys,zs) (_, xs) - ([],xs) makeTwoLists :: Char - ([Char], [Char]) makeTwoLists c = splitAtLazy (repeat ()) $ repeat c main :: IO () main = do succUnsafe - makeSuccUnsafe succPlain - makeSuccPlain writeFile test.txt $ let (prefix, suffix) = makeTwoLists 'a' in case 3::Int of -- no leak 0 - succUnsafe $ prefix ++ suffix -- no leak 1 - succPlain $ prefix ++ suffix -- no leak 2 - succPlain prefix ++ succPlain suffix -- leak 3 - succUnsafe prefix ++ succPlain suffix -- no leak 4 - succPlain prefix ++ succUnsafe suffix -- leak 5 - succUnsafe prefix ++ succUnsafe suffix _ - error not implemented ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Arnoldo Muller arnoldomul...@gmail.com writes: I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory. Funny you should mention it. I've written a bioinformatics library¹ that (naturally) supports reading and writing various file formats for sequences and alignments and stuff. Some of these files can be substantial in size (i.e., larger than my laptop's memory), so most IO of potentially large files (Fasta, BLAST XMl output, 454 SFF files...) are read lazily, and large Fasta sequences are read as lazy bytestrings. This works nicely for a lot of use cases (well, my use cases, at any rate, wich quite often boils down to streaming through the data). One thing to look out for is O(n) indexed access to lazy bytestrings, so there's a defragment operation that converts a sequence to a single chunk (which gives O(1) access, but of course must fit into memory). I guess the most annoying thing about laziness is that small test cases always work, you need Real Data to stress test your programs for excessive memory use. Lazy IO always worked well for me, so althouhg I feel I should look more deeply into real solutions, like Iteratee, my half-hearted attemts to do so have only resulted in the conclusion that it was more complicated, and thus postponed for some rainy day... lazy IO for lazy programmers, I guess. -k ¹ Stuff's on Hackage in the bioinformatics section and also on http://blog.malde.org and http//malde.org/~ketil/bioinformatics. -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller arnoldomul...@gmail.comwrote: Daniel, Thank you so much for helping me out with this issue! Thanks to all the other answers from haskel-cafe members too! As a newbie, I am not able to understand why zip and map would make a problem... Is there any link I could read that could help me to understand why in this case zip and map created a leak? What are some function compositions that should be avoided when doing lazy I/O? Actually, it's lazy I/O itself that should be avoided. Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
On Wed, Mar 10, 2010 at 2:03 PM, Arnoldo Muller arnoldomul...@gmail.comwrote: Hello Justin, I tried and what I saw was a constant increase in memory usage. Any particular profiling option that you would use? A great place to get started with profiling is the chapter in Real-World Haskell: http://book.realworldhaskell.org/read/profiling-and-optimization.html For a problem like this I would look at general heap profiling (-hc), retainer profiling (-hr), and also type profiling (-hy) to see if any of them provide new insight. For example, -hc might tell you which functions are problematic, but -hr is more likely to help you there. Sometimes the specific type that is leaking is not what you think and that's why -hy is nice. I hope that helps, Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Jason, I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory. In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell. Besides the space leaks that can occur and that are a bit difficult to find for a newbie like me, are there any other reasons to avoid Lazy I/O? Arnoldo. On Sat, Mar 13, 2010 at 6:46 PM, Jason Dagit da...@codersbase.com wrote: On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller arnoldomul...@gmail.comwrote: Daniel, Thank you so much for helping me out with this issue! Thanks to all the other answers from haskel-cafe members too! As a newbie, I am not able to understand why zip and map would make a problem... Is there any link I could read that could help me to understand why in this case zip and map created a leak? What are some function compositions that should be avoided when doing lazy I/O? Actually, it's lazy I/O itself that should be avoided. Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
On Sat, Mar 13, 2010 at 3:58 PM, Arnoldo Muller arnoldomul...@gmail.comwrote: Jason, I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory. In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell. There's a safer lazy IO lib in Hackage: http://hackage.haskell.org/package/safe-lazy-io It seems the safer approach, though somewhat more confusing to some people, is the Iteratee pattern. The reasons why have probably been explained best on a paper on Oleg's site. Besides the space leaks that can occur and that are a bit difficult to find for a newbie like me, are there any other reasons to avoid Lazy I/O? Perhaps these two links will enlighten you. They did for me, and I'm now working out how exactly to convert a really inefficient but explicit IO program (char by char right now... yuck) to an Iteratee based parsing situation on a work-related project. Hopefully I'll be doing this all this coming week, and I'll be able to publish some results on my blog. (things come up though a lot at work, so I'm keeping my fingers crossed on this one). http://okmij.org/ftp/Haskell/Iteratee/Lazy-vs-correct.txt http://okmij.org/ftp/Streams.html Dave Arnoldo. On Sat, Mar 13, 2010 at 6:46 PM, Jason Dagit da...@codersbase.com wrote: On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller arnoldomul...@gmail.comwrote: Daniel, Thank you so much for helping me out with this issue! Thanks to all the other answers from haskel-cafe members too! As a newbie, I am not able to understand why zip and map would make a problem... Is there any link I could read that could help me to understand why in this case zip and map created a leak? What are some function compositions that should be avoided when doing lazy I/O? Actually, it's lazy I/O itself that should be avoided. Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
On Sat, Mar 13, 2010 at 8:58 PM, Arnoldo Muller arnoldomul...@gmail.com wrote: Jason, I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory. In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell. What about mmap function? It's available on Linux, you can use it on C and probably a lot of other imperative languages. Was the non-portability factor the issue with using it? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
On Mar 13, 2010, at 18:58 , Arnoldo Muller wrote: In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell. Besides the space leaks that can occur and that are a bit difficult to find for a newbie like me, are there any other reasons to avoid Lazy I/O? The biggest problem is that it is completely impossible to detect, much less recover from, lazy I/O errors. (You could theoretically force the result under control of evaluate, thus putting it back in IO, but then you lose all the laziness you wanted. Exceptions, in particular I/O exceptions, are by definition impure; so pure code can neither recognize nor deal with them.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Am Sonntag 14 März 2010 00:58:09 schrieb Arnoldo Muller: Jason, I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory. In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell. Besides the space leaks that can occur and that are a bit difficult to find for a newbie like me, are there any other reasons to avoid Lazy I/O? You may be happy to hear that the space leak you encountered had __nothing whatsoever__ to do with lazy IO. It's true that lazy IO offers some pitfalls for the unwary (and some, but much fewer, for the wary), but I think the dangers of lazy IO tend to be exaggerated. For your application, readFile and appendFile are absolutely fine, the space leak occurred in the pure code. Below is a variant of your programme that doesn't use file-IO, the one readFasta function has the space leak, the currently commented-out one not. Compile with -O2, run with e.g. ./leak +RTS -s -M400M -RTS 3 1000 10 one runs in constant space, the other not. Arnoldo. -- {-# LANGUAGE BangPatterns #-} module Main (main) where import Data.List import System.Environment (getArgs) data Chromosome = C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) type Sequence = [Char] data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } instance Show Window where show w = (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w) main = do [ct, len, windowSize] - getArgs let wSize = (read windowSize)::Int ln = read len inData = [(cn,ln) | cn - [1 .. read ct]] mapM_ (uncurry $ genomeExecute filterWindow wSize) inData countLines :: String - Int countLines = go 0 where go !acc [] = acc go !acc ('\n':cs) = go (acc+1) cs go !acc (_:cs ) = go acc cs genomeExecute :: (Window - Bool) - Int - Int - Int - IO () genomeExecute flt wSize cn ln = print . countLines $ fastaExtractor (chromosome ++ show cn ++ ,\n ++ replicate (cn*ln) 'A') wSize flt fastaExtractor :: String - Int - (Window - Bool) - String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input filterWindow :: Window - Bool filterWindow w = not (elem 'N' (sequen w)) printWindowList :: [Window] - String printWindowList l = unlines $ map show l {- readFasta :: Int - [Char] - [Window] readFasta windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header go i (w:ws) = Window w chr i : go (i+1) ws go _ [] = [] in go 0 $ slideWindow windowSize $ filter (/= '\n') rest -} readFasta :: Int - [Char] - [Window] readFasta windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header in map (\(i, w) - Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) rest slideWindow :: Int - [Char] - [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs parseChromosome :: [Char] - Chromosome parseChromosome line | isInfixOf chromosome 1, line = C1 | isInfixOf chromosome 2, line = C2 | isInfixOf chromosome 3, line = C3 | isInfixOf chromosome 4, line = C4 | isInfixOf chromosome 5, line = C5 | isInfixOf chromosome 6, line = C6 | isInfixOf chromosome 7, line = C7 | isInfixOf chromosome 8, line = C9 | isInfixOf chromosome 10, line = C10 | isInfixOf chromosome 11, line = C11 | isInfixOf chromosome 12, line = C12 | isInfixOf chromosome 13, line = C13 | isInfixOf chromosome 14, line = C14 | isInfixOf chromosome 15, line = C15 | isInfixOf chromosome 16, line = C16 | isInfixOf chromosome 17 line = C17 | isInfixOf chromosome 18 line = C18 | isInfixOf chromosome 19 line = C19 | isInfixOf chromosome X line = CX | isInfixOf chromosome Y line = CY | isInfixOf mitochondrion line = CMT | otherwise = error BAD header -- ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org
Re: [Haskell-cafe] Space leak
Hi Arnoldo This doesn't address the space leak, but your parseChromosome function looks very inefficient - isInfixOf is repeatedly checking the prefix chromosome for C1 to CY. If you have a lot of CY's in a file then it will do a lot of work parsing them. The cleanest way of handling this would be to use a parser combinator library with keywords for chromosome and mitochondrion - however that might add a performance penalty itself. Here is a version that should be fairly efficient although a little ugly due to how it has to match literal chars in prefix of the string: Add a import for Data.Char to the import list: import Data.Char Add Enum to the deriving clause of the Chromosome data type: | C19 | CX | CY | CMT deriving (Show,Enum) Replace parseChromosome with the one below. Note that the derived Enum functions for Chromosome are indexed from 0.. whereas when you read one from the file it is indexed from 1.. so you have to sub1 before using toEnum: sub1 :: Int - Int sub1 x = x-1 parseChromosome :: [Char] - Chromosome parseChromosome ('c':'h':'r':'o':'m':'o':'s':'o':'m':'e':' ':xs) = chro xs where chro ('X' :_) = CX chro ('Y' :_) = CY chro ( x : ',' :_)| isDigit x = toEnum (sub1 $ digitToInt x) chro ('1' : x : ',' :_ ) | isDigit x = toEnum (sub1 $ (10+) $ digitToInt x) chro ('1' : x :_ ) | isDigit x = toEnum (sub1 $ (10+) $ digitToInt x) chro _ = error BAD header parseChromosome ('m':'i':'t':'o':'c':'h':'o':'n':'d':'r':'i':'o':'n':_) = CMT parseChromosome _ = error BAD header Best wishes Stephen ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Am Donnerstag 11 März 2010 00:24:28 schrieb Daniel Fischer: Hmm, offhand, I don't see why that isn't strict enough. Turns out, mapM_ was a red herring. The villain was (zip and map). I must confess, I don't know why it sort-of worked without the mapM_, though. sort-of, because that also hung on to unnecessarily much memory, the space leak was just smaller than with the mapM_. A very small change that eliminates the space leak, is readFasta :: Int - [Char] - [Window] readFasta windowSize sequence = -- get the header let (header,rest) = span (/= '\n') sequence chr = parseChromosome header go i (w:ws) = Window w chr i : go (i+1) ws go _ [] = [] in go 0 $ slideWindow windowSize $ filter (/= '\n') rest You can improve performance by eliminating slideWindow and the intermediate Window list (merging fastaExtractor and readFasta), {-# LANGUAGE BangPatterns #-} readFasta2 :: (String - Bool) - Int - String readFasta2 test windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header schr = show chr go !i st@(_:tl) | test w= w ++ '\t' : schr ++ '\t' : show i ++ '\n' : go (i+1) tl | otherwise = go (i+1) tl where w = take windowSize st go _ [] = [] in go 0 (filter (/= '\n')) rest ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Daniel, Thank you so much for helping me out with this issue! Thanks to all the other answers from haskel-cafe members too! As a newbie, I am not able to understand why zip and map would make a problem... Is there any link I could read that could help me to understand why in this case zip and map created a leak? What are some function compositions that should be avoided when doing lazy I/O? Regards, Arnoldo On Thu, Mar 11, 2010 at 11:46 PM, Daniel Fischer daniel.is.fisc...@web.dewrote: Am Donnerstag 11 März 2010 00:24:28 schrieb Daniel Fischer: Hmm, offhand, I don't see why that isn't strict enough. Turns out, mapM_ was a red herring. The villain was (zip and map). I must confess, I don't know why it sort-of worked without the mapM_, though. sort-of, because that also hung on to unnecessarily much memory, the space leak was just smaller than with the mapM_. A very small change that eliminates the space leak, is readFasta :: Int - [Char] - [Window] readFasta windowSize sequence = -- get the header let (header,rest) = span (/= '\n') sequence chr = parseChromosome header go i (w:ws) = Window w chr i : go (i+1) ws go _ [] = [] in go 0 $ slideWindow windowSize $ filter (/= '\n') rest You can improve performance by eliminating slideWindow and the intermediate Window list (merging fastaExtractor and readFasta), {-# LANGUAGE BangPatterns #-} readFasta2 :: (String - Bool) - Int - String readFasta2 test windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header schr = show chr go !i st@(_:tl) | test w= w ++ '\t' : schr ++ '\t' : show i ++ '\n' : go (i+1) tl | otherwise = go (i+1) tl where w = take windowSize st go _ [] = [] in go 0 (filter (/= '\n')) rest ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Space leak
Hello, I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping. This is version B: --- Program B import Data.List import System.Environment import System.Directory import Control.Monad -- different types of chromosomes data Chromosome =C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w) -- Reading fasta files with haskell -- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] - getArgs -- get directory contents (only names) names - getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames -- read the files one by one and write them to the output file genomeExecute :: String - Int - (Window - Bool) - String - IO () genomeExecute outputFile windowSize f inputFile = do fileData - readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f -- isFastaFile :: String - Bool isFastaFile fileName = isSuffixOf .fa fileName -- fasta extractor (receives a Fasta String and returns a windowed string ready to be sorted) -- an example on how to compose several functions to parse a fasta file fastaExtractor :: String - Int - (Window - Bool) - String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input -- MAIN FILTER that removes N elements from the strings! filterWindow :: Window - Bool filterWindow w = not (elem 'N' (sequen w)) -- print a window list (the printing makes it ready for output as raw data) printWindowList :: [Window] - String printWindowList l = unlines $ map show l -- read fasta, remove stuff that is not useful from it -- removes the readFasta :: Int - [Char] - [Window] readFasta windowSize sequence = -- get the header let (header:rest) = lines sequence chr = parseChromosome header in -- We now do the following: -- take window create counter remove newlines map (\(i, w) - Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) $ unlines rest slideWindow :: Int - [Char] - [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs -- Parse the chromosome from a fasta comment -- produce a more compact chromosome representation parseChromosome :: [Char] - Chromosome parseChromosome line | isInfixOf chromosome 1, line = C1 | isInfixOf chromosome 2, line = C2 | isInfixOf chromosome 3, line = C3 | isInfixOf chromosome 4, line = C4 | isInfixOf chromosome 5, line = C5 | isInfixOf chromosome 6, line = C6 | isInfixOf chromosome 7, line = C7 | isInfixOf chromosome 8, line = C9 | isInfixOf chromosome 10, line = C10 | isInfixOf chromosome 11, line = C11 | isInfixOf chromosome 12, line = C12 | isInfixOf chromosome 13, line = C13 | isInfixOf chromosome 14, line = C14 | isInfixOf chromosome 15, line = C15 | isInfixOf chromosome 16, line = C16 | isInfixOf chromosome 17 line = C17 | isInfixOf chromosome 18 line = C18 | isInfixOf chromosome 19 line = C19 | isInfixOf chromosome X line = CX | isInfixOf chromosome Y line = CY | isInfixOf mitochondrion line = CMT | otherwise = error BAD header End of program B
Re: [Haskell-cafe] Space leak
Hello Arnoldo, Wednesday, March 10, 2010, 11:45:56 PM, you wrote: I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. make some experiments - leave only one file and use version A, then replace appendFile with writeFile -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Am Mittwoch 10 März 2010 21:45:56 schrieb Arnoldo Muller: Hello, I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping. No work is been done until the end, when all is tried to be done simultaneously. Make sure genomeExecute ... input1 has actually finished its work before genomeExecute ... input2 starts etc. One way is to use a stricter version of sequence_, sequence'_ :: Monad m = [m a] - m () sequence'_ (x:xs) = do a - x a `seq` sequence'_ xs sequence'_ [] = return () (nicer with BangPatterns, but not portable), and mapM'_ f = sequence'_ . map f Another option is making genomeExecute itself stricter. This is version B: --- Program B import Data.List import System.Environment import System.Directory import Control.Monad -- different types of chromosomes data Chromosome =C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w) -- Reading fasta files with haskell -- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] - getArgs -- get directory contents (only names) names - getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames -- read the files one by one and write them to the output file genomeExecute :: String - Int - (Window - Bool) - String - IO () genomeExecute outputFile windowSize f inputFile = do fileData - readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Hello Daniel: Thanks! I employed mapM'_ but I am still getting the space leak. Any other hint? Arnoldo On Wed, Mar 10, 2010 at 10:40 PM, Daniel Fischer daniel.is.fisc...@web.dewrote: Am Mittwoch 10 März 2010 21:45:56 schrieb Arnoldo Muller: Hello, I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping. No work is been done until the end, when all is tried to be done simultaneously. Make sure genomeExecute ... input1 has actually finished its work before genomeExecute ... input2 starts etc. One way is to use a stricter version of sequence_, sequence'_ :: Monad m = [m a] - m () sequence'_ (x:xs) = do a - x a `seq` sequence'_ xs sequence'_ [] = return () (nicer with BangPatterns, but not portable), and mapM'_ f = sequence'_ . map f Another option is making genomeExecute itself stricter. This is version B: --- Program B import Data.List import System.Environment import System.Directory import Control.Monad -- different types of chromosomes data Chromosome =C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w) -- Reading fasta files with haskell -- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] - getArgs -- get directory contents (only names) names - getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames -- read the files one by one and write them to the output file genomeExecute :: String - Int - (Window - Bool) - String - IO () genomeExecute outputFile windowSize f inputFile = do fileData - readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Hello Bulat, I ran program A with writeFile instead of appendFile and it still works without problems. Regarding program B, if I use writeFile the leaking still occurs. Any other hints? :) Arnoldo On Wed, Mar 10, 2010 at 10:32 PM, Bulat Ziganshin bulat.zigans...@gmail.com wrote: Hello Arnoldo, Wednesday, March 10, 2010, 11:45:56 PM, you wrote: I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. make some experiments - leave only one file and use version A, then replace appendFile with writeFile -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Hello Justin, I tried and what I saw was a constant increase in memory usage. Any particular profiling option that you would use? I do remember that there was a particular option in which the leak would dissapear (for the same amount of work) and that is why I stopped using the profiler. Thanks, Arnoldo On Wed, Mar 10, 2010 at 10:20 PM, Justin Bailey jgbai...@gmail.com wrote: Have you use the profiling tools available with GHC? http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html On Wed, Mar 10, 2010 at 12:45 PM, Arnoldo Muller arnoldomul...@gmail.com wrote: Hello, I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping. This is version B: --- Program B import Data.List import System.Environment import System.Directory import Control.Monad -- different types of chromosomes data Chromosome =C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w) -- Reading fasta files with haskell -- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] - getArgs -- get directory contents (only names) names - getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames -- read the files one by one and write them to the output file genomeExecute :: String - Int - (Window - Bool) - String - IO () genomeExecute outputFile windowSize f inputFile = do fileData - readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f -- isFastaFile :: String - Bool isFastaFile fileName = isSuffixOf .fa fileName -- fasta extractor (receives a Fasta String and returns a windowed string ready to be sorted) -- an example on how to compose several functions to parse a fasta file fastaExtractor :: String - Int - (Window - Bool) - String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input -- MAIN FILTER that removes N elements from the strings! filterWindow :: Window - Bool filterWindow w = not (elem 'N' (sequen w)) -- print a window list (the printing makes it ready for output as raw data) printWindowList :: [Window] - String printWindowList l = unlines $ map show l -- read fasta, remove stuff that is not useful from it -- removes the readFasta :: Int - [Char] - [Window] readFasta windowSize sequence = -- get the header let (header:rest) = lines sequence chr = parseChromosome header in -- We now do the following: -- take window create counter remove newlines map (\(i, w) - Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) $ unlines rest slideWindow :: Int - [Char] - [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs -- Parse the chromosome from a fasta comment -- produce a more compact chromosome representation parseChromosome :: [Char] - Chromosome parseChromosome line | isInfixOf chromosome 1, line = C1 | isInfixOf chromosome 2, line = C2 | isInfixOf chromosome 3, line
Re: [Haskell-cafe] Space leak
Hello Arnoldo, Wednesday, March 10, 2010, 11:45:56 PM, you wrote: I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. what if you use program B on single file? -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Bulat, The same happens, the memory starts to quickly fill up... Arnoldo On Wed, Mar 10, 2010 at 11:16 PM, Bulat Ziganshin bulat.zigans...@gmail.com wrote: Hello Arnoldo, Wednesday, March 10, 2010, 11:45:56 PM, you wrote: I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. what if you use program B on single file? -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Am Mittwoch 10 März 2010 23:01:28 schrieb Arnoldo Muller: Hello Daniel: Thanks! I employed mapM'_ but I am still getting the space leak. Any other hint? Hmm, offhand, I don't see why that isn't strict enough. With some datafiles, I could try to investigate. One question, how does programme C with main = do [input, output, windowSize] - getArgs let wSize = (read windowSize)::Int genomeExecute output wSize filterWindow input behave? Space leak or not? But yes, a few other hints I have (though they're not likely to squash the space leak). Generally, ByteString IO is often orders of magnitude faster than String IO and uses much less memory, so using (lazy) ByteStrings is worthy of consideration. Arnoldo -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w) -- Reading fasta files with haskell -- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] - getArgs -- get directory contents (only names) names - getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x) names * let fullNames = map ((input ++) . (/ ++)) $ filter isFastaFile names saves a little work * let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames -- read the files one by one and write them to the output file genomeExecute :: String - Int - (Window - Bool) - String - IO () genomeExecute outputFile windowSize f inputFile = do fileData - readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f * The arguments of fastaExtractor should be reversed, then genomeExecute outputFile windowSize f inputFile = appendFile outputFile . fastaExtractor' f windowSize = readFile inputFile * -- isFastaFile :: String - Bool isFastaFile fileName = isSuffixOf .fa fileName -- fasta extractor (receives a Fasta String and returns a windowed string ready to be sorted) -- an example on how to compose several functions to parse a fasta file fastaExtractor :: String - Int - (Window - Bool) - String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input * fastaExtractor' f wSize = printWindowList . filter f . readFasta wSize * -- MAIN FILTER that removes N elements from the strings! filterWindow :: Window - Bool filterWindow w = not (elem 'N' (sequen w)) * filterWindow w = 'N' `notElem` sequen w * -- print a window list (the printing makes it ready for output as raw data) printWindowList :: [Window] - String printWindowList l = unlines $ map show l -- read fasta, remove stuff that is not useful from it -- removes the readFasta :: Int - [Char] - [Window] readFasta windowSize sequence = -- get the header let (header:rest) = lines sequence chr = parseChromosome header in -- We now do the following: -- take window create counter remove newlines map (\(i, w) - Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) $ unlines rest * filter ('\n' /=) . unlines is odd. What about concat? Or readFasta wSize chrseq = case span (/= '\n') chrseq of (header, _:rest) - let chr = parseChromosome header in map (\(i,w) - Window w chr i) . zip [0 .. ] . slideWindow wSize $ filter (/= '\n') rest _ - [] if your input file format had no other newline than the one between header and body, that'd be nice. * slideWindow :: Int - [Char] - [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs * slideWindow wSize = map (take wSize) . tails * ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space Leak with semi-implicit parallelization and the nasty Garbage collector
Hello Daniel, thanks for your fast response. That's strange: On your system total time elapsed according to GHC is ~190%, on mine (reproducible!) ~140% (see below). I once had a problem with a particular linux kernel[1], unfortunately I currently (over the holidays) have no other computers available. Does anyone also have an up-to-date ubuntu karmic 9.10 with my kernel version to confirm or refute the problem's dependency on the linux kernel? A bit irritated, but still in an Haskelllish christmas mood, Michael [1] http://www.mail-archive.com/haskell-cafe@haskell.org/msg67148.html time exa +RTS -N2 -ls -sstderr exa +RTS -N2 -ls -sstderr 1 3 0 0 9 3 6 9 1 8 8 9 2 5 5 8 6 5 7 8 4 72,748,227,888 bytes allocated in the heap 65,331,056 bytes copied during GC 183,032 bytes maximum residency (22 sample(s)) 209,352 bytes maximum slop 4 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 131339 collections, 131338 parallel, 7.29s, 6.25s elapsed Generation 1:22 collections,22 parallel, 0.03s, 0.03s elapsed Parallel GC work balance: 1.10 (7778368 / 7059369, ideal 2) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 66.38s( 36.22s) 1.75s( 1.37s) Task 1 (worker) : 66.40s( 36.25s) 0.00s( 0.00s) Task 2 (worker) : 68.13s( 36.25s) 0.00s( 0.00s) Task 3 (worker) : 62.56s( 36.25s) 5.57s( 4.91s) SPARKS: 21 (21 converted, 0 pruned) INIT time0.00s ( 0.01s elapsed) MUT time 60.81s ( 36.25s elapsed) GCtime7.32s ( 6.28s elapsed) EXIT time0.00s ( 0.00s elapsed) Total time 66.40s ( 42.54s elapsed) %GC time 11.0% (14.8% elapsed) Alloc rate1,231,351,182 bytes per MUT second Productivity 89.0% of total user, 138.9% of total elapsed gc_alloc_block_sync: 61137 whitehole_spin: 0 gen[0].steps[0].sync_large_objects: 17681 gen[0].steps[1].sync_large_objects: 20647 gen[1].steps[0].sync_large_objects: 0 real0m42.626s user1m6.400s sys 0m0.510s 2009/12/24 Daniel Fischer daniel.is.fisc...@web.de: Am Donnerstag 24 Dezember 2009 02:14:51 schrieb Michael Lesniak: Hello haskell-cafe (and merry christmas!), I have a strange problem with the garbage collector / memory which I'm unable to find a solution for. I think the source of my problems has to do with lazy evaluation, but currently I'm unable to find it. Using the attached program and threadscope I see that the GC is using a lot of time and the program comes (more or less) to a halt (see exa-1.png). When I increase the heap the program takes much longer and the GC is running more or less all the time (see exa-2.png). Some more detailled information: * I can see the described behaviour under both GHC 10.4 and GHC 12.1 * Linux kernel 2.6.31-16 on a dualcore * Program compiled with ghc --make -O2 -threaded -eventlog Example.hs -o exa * Started with exa +RTS -ls and one of { -N, -N1, -N2 } Any help (pointing into the right direction, mention possibly helpful blog articles or paper, constructive critic in general) is appreciated! I can't reproduce that (ghc-6.12.1, Linux linux-mkk1 2.6.27.39-0.2-pae #1 SMP 2009-11-23 12:57:38 +0100 i686 i686 i386 GNU/Linux, dual core): $ time ./exa +RTS -ls -N2 -sstderr ./exa +RTS -ls -N2 -sstderr 1 3 0 0 9 3 6 9 1 8 8 9 2 5 5 8 6 5 7 8 4 72,499,126,908 bytes allocated in the heap 45,280,708 bytes copied during GC 177,504 bytes maximum residency (10 sample(s)) 183,844 bytes maximum slop 4 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 131527 collections, 131526 parallel, 7.18s, 3.88s elapsed Generation 1: 10 collections, 10 parallel, 0.00s, 0.00s elapsed Parallel GC work balance: 1.16 (10931266 / 9433437, ideal 2) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 115.10s (126.56s) 3.34s ( 1.84s) Task 1 (worker) : 124.21s (126.56s) 3.84s ( 2.04s) Task 2 (worker) : 0.09s (126.56s) 0.00s ( 0.00s) Task 3 (worker) : 0.00s (126.56s) 0.00s ( 0.00s) SPARKS: 21 (21 converted, 0 pruned) INIT time 0.00s ( 0.13s elapsed) MUT time 238.05s (126.56s elapsed) GC time 7.19s ( 3.88s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 244.46s (130.57s elapsed) %GC time 2.9% (3.0% elapsed) Alloc rate 305,559,453 bytes per MUT second Productivity 97.1% of total user, 181.7% of total elapsed gc_alloc_block_sync: 151252 whitehole_spin: 0 gen[0].steps[0].sync_large_objects: 75620 gen[0].steps[1].sync_large_objects: 9662 gen[1].steps[0].sync_large_objects: 0 244.45user 2.06system 2:10.58elapsed 188%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+35736outputs (0major+2426minor)pagefaults 0swaps Garbage collection isn't even visible in the threadscope profile. With -N1: 71,999,280,108 bytes allocated in the heap 20,729,380 bytes
Re: [Haskell-cafe] Space Leak with semi-implicit parallelization and the nasty Garbage collector
Am Donnerstag 24 Dezember 2009 02:14:51 schrieb Michael Lesniak: Hello haskell-cafe (and merry christmas!), I have a strange problem with the garbage collector / memory which I'm unable to find a solution for. I think the source of my problems has to do with lazy evaluation, but currently I'm unable to find it. Using the attached program and threadscope I see that the GC is using a lot of time and the program comes (more or less) to a halt (see exa-1.png). When I increase the heap the program takes much longer and the GC is running more or less all the time (see exa-2.png). Some more detailled information: * I can see the described behaviour under both GHC 10.4 and GHC 12.1 * Linux kernel 2.6.31-16 on a dualcore * Program compiled with ghc --make -O2 -threaded -eventlog Example.hs -o exa * Started with exa +RTS -ls and one of { -N, -N1, -N2 } Any help (pointing into the right direction, mention possibly helpful blog articles or paper, constructive critic in general) is appreciated! I can't reproduce that (ghc-6.12.1, Linux linux-mkk1 2.6.27.39-0.2-pae #1 SMP 2009-11-23 12:57:38 +0100 i686 i686 i386 GNU/Linux, dual core): $ time ./exa +RTS -ls -N2 -sstderr ./exa +RTS -ls -N2 -sstderr 1 3 0 0 9 3 6 9 1 8 8 9 2 5 5 8 6 5 7 8 4 72,499,126,908 bytes allocated in the heap 45,280,708 bytes copied during GC 177,504 bytes maximum residency (10 sample(s)) 183,844 bytes maximum slop 4 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 131527 collections, 131526 parallel, 7.18s, 3.88s elapsed Generation 1:10 collections,10 parallel, 0.00s, 0.00s elapsed Parallel GC work balance: 1.16 (10931266 / 9433437, ideal 2) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 115.10s(126.56s) 3.34s( 1.84s) Task 1 (worker) : 124.21s(126.56s) 3.84s( 2.04s) Task 2 (worker) :0.09s(126.56s) 0.00s( 0.00s) Task 3 (worker) :0.00s(126.56s) 0.00s( 0.00s) SPARKS: 21 (21 converted, 0 pruned) INIT time0.00s ( 0.13s elapsed) MUT time 238.05s (126.56s elapsed) GCtime7.19s ( 3.88s elapsed) EXIT time0.00s ( 0.00s elapsed) Total time 244.46s (130.57s elapsed) %GC time 2.9% (3.0% elapsed) Alloc rate305,559,453 bytes per MUT second Productivity 97.1% of total user, 181.7% of total elapsed gc_alloc_block_sync: 151252 whitehole_spin: 0 gen[0].steps[0].sync_large_objects: 75620 gen[0].steps[1].sync_large_objects: 9662 gen[1].steps[0].sync_large_objects: 0 244.45user 2.06system 2:10.58elapsed 188%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+35736outputs (0major+2426minor)pagefaults 0swaps Garbage collection isn't even visible in the threadscope profile. With -N1: 71,999,280,108 bytes allocated in the heap 20,729,380 bytes copied during GC 92,492 bytes maximum residency (2 sample(s)) 190,872 bytes maximum slop 3 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 130901 collections, 0 parallel, 2.64s, 2.68s elapsed Generation 1: 2 collections, 0 parallel, 0.00s, 0.00s
[Haskell-cafe] space leak hints?
Good evening, all, I wonder if I could tap your collective wisdom regarding space leaks? I've been messing about with haskeem, my little scheme interpreter, and I decided to see if I could make it run reasonably space-efficiently. So far... no. Here's what I tried: I wrote a tiny scheme program to compute Collatz sequences for successive numbers, starting from 1 and incrementing forever (well, in principle). Because real scheme implementations are fully tail-call-optimized, this'll run in constant memory; I checked that with mzscheme, and it does indeed work. With my little interpreter, that's not the case: memory usage grows continually, although apparently less-than-linearly. I've built the interpreter with the profiling stuff described on the wiki and in RWH Ch 25 turned on and have made a few runs with that; I stuck the postscript plot that's the result of one of those runs onto my web site at http://www.korgwal.com/haskeem/run2.ps. The full source to the interpreter is a little large to paste into this message; it's available on my web site, and also on hackage. But according to the plot, there appear to be three main memory allocation issues, and they seem to all be related, if I'm reading stuff correctly. The core of the interpreter is a function, evalLisp, which evaluates scheme forms. There are of course a fair number of different forms, but the largest generic usage is evaluate each of a list of forms, returning the value of the last of them as the overall result. In order to express that in a couple of different places, and to accomodate the possibility of an empty list, I have a really tiny function lastOrNil which just calls last on a (haskell) list, checking for the possibility of an empty list, and returning a haskeem LispVal object: lastOrNil = liftM lON where lON [] = List [] lON l = last l (sorry, proportional fonting may be throwing this off). It's this function which is directly implicated in two of the top three memory pigs, and nearly directly in the third one as well. If I could eliminate the memory growth in these three cost centers, I would already capture over 90% of the growth in this benchmark, which would make me very happy indeed. But I don't really understand what is going on here. It seems entirely plausible, indeed likely, that the list which I'm traversing there is not fully evaluated. So I've tried adding 'seq' to this function. Uhhh... from memory, I dumped it after it didn't work, I had lastOrNil = liftM lON where lON [] = List [] lON (l:[]) = l lON (l:ls) = seq l (lON ls) (again, proportional fonting might mess me up here.) As I said, I dumped this after it made no difference whatsoever. I also tried bang-patterns in a couple of places, strictness annotation of my basic LispVal types... nothing. It all made exactly no difference, as far as I could tell. I've tried a couple of google searches, but haven't come up with anything better than what I've already described. So I'm a stumped chump! I'd be grateful for any suggestions... thanks in advance! Uwe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] space leak with 'concat' ?
$ ghc +RTS -M16m -c30 -RTS -e 'concat $ repeat bla' This breaks down after a while, also if I increase the memory restriction: ... ablablablablablablablablablablablablablablablablablablablablablaHeap exhausted; Current maximum heap size is 15998976 bytes (15 Mb); use `+RTS -Msize' to increase it. Whereas this one works: $ ghc +RTS -M16m -c30 -RTS -e 'cycle bla' 'concat' seems to be the culprit. What's so bad about it? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] space leak with 'concat' ?
On Tue, 2009-01-27 at 22:12 +0100, Henning Thielemann wrote: $ ghc +RTS -M16m -c30 -RTS -e 'concat $ repeat bla' This breaks down after a while, also if I increase the memory restriction: ... ablablablablablablablablablablablablablablablablablablablablablaHeap exhausted; Current maximum heap size is 15998976 bytes (15 Mb); use `+RTS -Msize' to increase it. Whereas this one works: $ ghc +RTS -M16m -c30 -RTS -e 'cycle bla' 'concat' seems to be the culprit. What's so bad about it? cycle builds a circular list: cycle xn = let ys = xn ++ ys in ys which concat isn't smart enough to do (obviously). That's not an issue normally, since the list is being consumed in a properly lazy fashion. But it looks like, for some reason, ghc -e is holding a pointer to the entire expression-to-evaluate in this case, so cons cells that have already been printed don't get garbage collected. To show that there's nothing wrong with concat per se, try this version instead: ghc +RTS -M16m -c30 -RTS -e 'print $ concat $ repeat bla' This should print forever without any problems. jcc ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] space leak with 'concat' ?
On Tue, 27 Jan 2009, Jonathan Cast wrote: To show that there's nothing wrong with concat per se, try this version instead: ghc +RTS -M16m -c30 -RTS -e 'print $ concat $ repeat bla' This should print forever without any problems. You are right, this works. My example was extracted from a larger module. But in that module I defined the text to be printed as top-level variable which might have been the problem. But this can't be the problem of the compiled version of the program, where I encountered the leak. So I have to keep on searching that leak. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] space leak with 'concat' ?
On Tue, 27 Jan 2009, Henning Thielemann wrote: On Tue, 27 Jan 2009, Jonathan Cast wrote: To show that there's nothing wrong with concat per se, try this version instead: ghc +RTS -M16m -c30 -RTS -e 'print $ concat $ repeat bla' This should print forever without any problems. You are right, this works. My example was extracted from a larger module. But in that module I defined the text to be printed as top-level variable which might have been the problem. But this can't be the problem of the compiled version of the program, where I encountered the leak. So I have to keep on searching that leak. Yes, the top-level variables seem to be the leak. I had to turn them into functions with dummy arguments _and_ attach INLINE pragmas to them in order to keep GHC away from buffering their content. Is there a less ugly way to achieve this? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] space leak with 'concat' ?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Henning Thielemann wrote: | in that module I defined the text to be printed as top-level | variable which might have been the problem. But this can't be the | problem of the compiled version of the program, where I encountered the | leak. So I have to keep on searching that leak. You have created a constant applicative form (commonly abbreviated CAF). GHC assumes that all top level declarations are constants, and simply does not garbage collect them. In the case of infinite structures, this can be a bad thing. This *does* affect even compiled code. The best way to avoid the problem, of course, is to avoid having infinite constants at the top level. Assuming that is impossible, your solution seems acceptable to me. Somebody more knowledgeable or creative than I may be able to come up with something nicer, though. - - Jake -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkl/lz0ACgkQye5hVyvIUKl0JgCgx5ddBc0Y44+ghFakr7Mex1RP zfUAnjh9BDI5+A9tEnaox20DbXbipX33 =2MCw -END PGP SIGNATURE- ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] space leak with 'concat' ?
Note that only monomorphic declarations are CAFed. If you have an explicit polymorphic signature, it will be treated as a function and garbage-collected as usual. So if you have, e.g., a list of Doubles, declaring it as foo :: Num a = [a] would do the trick. Cheers, S. On Tue, Jan 27, 2009 at 6:22 PM, Jake McArthur j...@pikewerks.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Henning Thielemann wrote: | in that module I defined the text to be printed as top-level | variable which might have been the problem. But this can't be the | problem of the compiled version of the program, where I encountered the | leak. So I have to keep on searching that leak. You have created a constant applicative form (commonly abbreviated CAF). GHC assumes that all top level declarations are constants, and simply does not garbage collect them. In the case of infinite structures, this can be a bad thing. This *does* affect even compiled code. The best way to avoid the problem, of course, is to avoid having infinite constants at the top level. Assuming that is impossible, your solution seems acceptable to me. Somebody more knowledgeable or creative than I may be able to come up with something nicer, though. - - Jake -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkl/lz0ACgkQye5hVyvIUKl0JgCgx5ddBc0Y44+ghFakr7Mex1RP zfUAnjh9BDI5+A9tEnaox20DbXbipX33 =2MCw -END PGP SIGNATURE- ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Space leak with Data.Binary and decodeFile
Hello all, I've been observing a huge space leak with some code using Data.Binary that I cannot make sense of and I hope someone here can shed some light on this, so I'll try to explain my problem as clearly as possible. I qualify the space leak as huge because if I let the program run, it will soon consume the whole memory available (~3G) and finally will get killed by the system. The code I'm writing implements a search algorithm using an inverted index. This index is built from a Trie [Int] (from the bytestring-trie package) and an Array Int ByteString. The trie maps each referenced word to an integer list that is a list of indices into the array. Here is the code for the Index datatype and the obvious Binary instance: data Index = Index { entries :: Array Int ByteString , invidx :: Trie [Int } instance Binary Index where put (Index dirs idx) = put dirs put idx get = liftM2 get get I have no problems creating and seralizing this data structure to a file. The huge leak appears instead when I'm reading this data structure from a file and try to do something with it. This is the smallest test case I came up with that can reproduce the problem : main = do idx - decodeFile list.idx; mapM_ (B.putStrLn . snd) (assocs (entries idx)) The space leak also appears when I try to touch the trie instead of the array. I've been trying tons of combinations involving adding or removing strictness annotations and seq calls in various places with no luck. I have also been adding SCC annotations and tried to profile the code. This seemed to suggest the space leak happens in the get method of the Array instance of Binary : instance (Binary i, Ix i, Binary e) = Binary (Array i e) where [...] get = do bs - get n - get -- read the length xs - replicateM n get -- now the elems. return (listArray bs xs) The output of the profiler tells me that all the space gets allocated from the replicateM n get expression. Now for the really weird part: if I load my code in GHCi and type main, I can observe the space leak. However, if I copy paste the definition of main instead, the code runs fine! This is the only circumstance I've seen this code work instead of eating all the RAM... I have been using GHC 6.10.1, binary 0.4.4 and bytestring-trie 0.1.2. If there's anything else that I can do to understand what's going on, I would gladfully hear about it. Please also tell me if I should provide more information. Thanks, -- Maxime Henrion ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak - help needed
On Thu, Mar 13, 2008 at 4:50 PM, Krzysztof Kościuszkiewicz [EMAIL PROTECTED] wrote: Retainers are thunks or objects on stack that keep references to live objects. All retainers of an object are called the object's retainer set. Now when one makes a profiling run, say with ./jobname +RTS -p -hr, the graph refernces retainer sets from jobname.prof. My understanding is that it is the total size of all objects retained by retainer sets being plotted, correct? Yes, all retainer sets are being profiled. However, you can FILTER the retainer sets profiled to those containing certain cost-centres. This is a key point because it allows you to divide-and-conquer when tracking down a retainer leak. That is, if you filter to a certain cost-centre and the retainer graph is flat, you know that cost-centre is not involved. For example, if you have a cost-centre annotation like {-# SCC leaky #-} in your code, you can filter the retainer set like this: Leaky.exe +RTS -hr -hCleaky -RTS Review the documentation for other options. About decoding the sets from jobname.prof - for example in SET 2 = {MAIN.SYSTEM} SET 16 = {Main.CAF, MAIN.SYSTEM} SET 18 = {MAIN.SYSTEM, Main.many1,Main.list,Main.expr,Main.CAF} {...} means it's a set, and ccN,...,cc0 is the retainer cost centre (ccN) and hierarchy of parent cost centres up to the top level (cc0)? My understanding is that SET 18 above refers to objects that are retained by exactly two specified cost centres, right? The docs say An object B retains object A if (i) B is a retainer object and (ii) object A can be reached by recursively following pointers starting from object B, but not meeting any other retainer objects on the way. Each live object is retained by one or more retainer objects, collectively called its retainer set ... That says to me that SET18 above is the set of all objects which are retained by those two call stacks, and only those call stacks. The individual .. items aren't call stacks but I think they refer to where the retaining object (B in the paragraph) was itself retained, so they are like call stacks. My intuition is very fuzzy here. Finally, what is the MAIN.SYSTEM retainer? I think that is everything else - any object created in the runtime system that is not directly attributable to something being profiled. Maybe it is objects from libraries that were not compiled with profiling? I imagine objects created by the GHC primitives would fall in this category too. Since someone else found your space leak, does the retainer profiling advice point to it? I'd like to know if it is actually accurate or not! I've only applied it in some very limited situations. Justin ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak - help needed
Krzysztof Kościuszkiewicz wrote: I have tried both Poly.StateLazy and Poly.State and they work quite well - at least the space leak is eliminated. Now evaluation of the parser state blows the stack... The code is at http://hpaste.org/6310 Apparently, stUpdate is too lazy. I'd define stUpdate' :: (s - s) - Parser s t () stUpdate' f = stUpdate f stGet = (`seq` return ()) and try using stUpdate' instead of stUpdate in incCount. HTH, Bertram ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak - help needed
On Thu, Mar 13, 2008 at 05:52:05PM +0100, Bertram Felgenhauer wrote: ... Now evaluation of the parser state blows the stack... The code is at http://hpaste.org/6310 Apparently, stUpdate is too lazy. I'd define stUpdate' :: (s - s) - Parser s t () stUpdate' f = stUpdate f stGet = (`seq` return ()) and try using stUpdate' instead of stUpdate in incCount. Yes, that solves the stack issue. Thanks! -- Krzysztof Kościuszkiewicz Skype: dr.vee, Gadu: 111851, Jabber: [EMAIL PROTECTED] Simplicity is the ultimate sophistication -- Leonardo da Vinci ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak - help needed
On Wed, Mar 12, 2008 at 12:34:38PM -0700, Justin Bailey wrote: The stack blows up when a bunch of unevaluated thunks build up, and you try to evaluate them. One way to determine where those thunks are getting built is to use GHCs retainer profiling. Retainer sets will show you the call stack that is holding on to memory. That can give you a clue where these thunks are being created. To get finer-grained results, annotate your code with {#- SCC ... #-} pragmas. Then you can filter the retainer profile by those annotations. That will help you determine where in a given function the thunks are being created. If you need help with profiling basics, feel free to ask. I'm not entirely sure if I understand retainer profiling correctly... So please clarify if you spot any obvious blunders. Retainers are thunks or objects on stack that keep references to live objects. All retainers of an object are called the object's retainer set. Now when one makes a profiling run, say with ./jobname +RTS -p -hr, the graph refernces retainer sets from jobname.prof. My understanding is that it is the total size of all objects retained by retainer sets being plotted, correct? About decoding the sets from jobname.prof - for example in SET 2 = {MAIN.SYSTEM} SET 16 = {Main.CAF, MAIN.SYSTEM} SET 18 = {MAIN.SYSTEM, Main.many1,Main.list,Main.expr,Main.CAF} {...} means it's a set, and ccN,...,cc0 is the retainer cost centre (ccN) and hierarchy of parent cost centres up to the top level (cc0)? My understanding is that SET 18 above refers to objects that are retained by exactly two specified cost centres, right? Finally, what is the MAIN.SYSTEM retainer? Thanks in advance, -- Krzysztof Kościuszkiewicz Skype: dr.vee, Gadu: 111851, Jabber: [EMAIL PROTECTED] Simplicity is the ultimate sophistication -- Leonardo da Vinci ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak - help needed
On Mon, Mar 03, 2008 at 05:20:09AM +0100, Bertram Felgenhauer wrote: Another story from an (almost) happy Haskell user that finds himself overwhelmed by laziness/space leaks. I'm trying to parse a large file (600MB) with a single S-expression like structure. With the help of ByteStrings I'm down to 4min processing time in constant space. However, when I try to wrap the parse results in a data structure, the heap blows up - even though I never actually inspect the structure being built! This bugs me, so I come here looking for answers. The polyparse library (http://www.cs.york.ac.uk/fp/polyparse/) offers some lazy parsers, maybe one of those fits your needs. Text.ParserCombinators.Poly.StateLazy is the obvious candidate. I have tried both Poly.StateLazy and Poly.State and they work quite well - at least the space leak is eliminated. Now evaluation of the parser state blows the stack... The code is at http://hpaste.org/6310 Thanks in advance, -- Krzysztof Kościuszkiewicz Skype: dr.vee, Gadu: 111851, Jabber: [EMAIL PROTECTED] Simplicity is the ultimate sophistication -- Leonardo da Vinci ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak - help needed
On Wed, Mar 12, 2008 at 12:12 PM, Krzysztof Kościuszkiewicz [EMAIL PROTECTED] wrote: I have tried both Poly.StateLazy and Poly.State and they work quite well - at least the space leak is eliminated. Now evaluation of the parser state blows the stack... The code is at http://hpaste.org/6310 Thanks in advance, The stack blows up when a bunch of unevaluated thunks build up, and you try to evaluate them. One way to determine where those thunks are getting built is to use GHCs retainer profiling. Retainer sets will show you the call stack that is holding on to memory. That can give you a clue where these thunks are being created. To get finer-grained results, annotate your code with {#- SCC ... #-} pragmas. Then you can filter the retainer profile by those annotations. That will help you determine where in a given function the thunks are being created. If you need help with profiling basics, feel free to ask. Justin ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Space leak - help needed
Dear Haskellers, Another story from an (almost) happy Haskell user that finds himself overwhelmed by laziness/space leaks. I'm trying to parse a large file (600MB) with a single S-expression like structure. With the help of ByteStrings I'm down to 4min processing time in constant space. However, when I try to wrap the parse results in a data structure, the heap blows up - even though I never actually inspect the structure being built! This bugs me, so I come here looking for answers. Parser follows: module Main where import qualified Data.ByteString.Lazy.Char8 as B import Text.ParserCombinators.Parsec import Text.ParserCombinators.Parsec.Pos import System.Environment import System.Exit import qualified Data.Map as M import Lexer type XdlParser a = GenParser Token XdlState a -- Parser state type XdlState = Counts type Counts = M.Map Count Integer data Count = ListCount | SymbolCount deriving (Eq, Ord, Show) emptyXdlState = M.empty incCount :: Count - (Counts - Counts) incCount c = M.insertWith' (+) c 1 -- handling tokens myToken :: (Token - Maybe a) - XdlParser a myToken test = token showTok posTok testTok where showTok = show posTok = const (initialPos ) testTok = test -- Syntax of expressions data Exp = Sym !B.ByteString | List ![Exp] deriving (Eq, Show) expr = list | symbol rparen = myToken $ \t - case t of RParen - Just () other - Nothing lparen = myToken $ \t - case t of LParen - Just () other - Nothing name = myToken $ \t - case t of Name n - Just n other - Nothing list = do updateState $ incCount ListCount lparen xs - many1 expr rparen return () -- return $! (List xs) symbol = do updateState $ incCount SymbolCount name return () -- Sym `fmap` name -- Top level parser top :: XdlParser XdlState top = do l - many1 list eof getState main = do args - getArgs case args of [fname] - do text - B.readFile fname let result = runParser top emptyXdlState fname (tokenize text) putStrLn $ either show show result _ - putStrLn usage: parse filename exitFailure And the Lexer: module Lexer (Token(..), tokenize) where import qualified Data.ByteString.Lazy.Char8 as B import Control.Monad import Data.Char import Data.List import System.Environment import System.Exit data Token = LParen | RParen | Name B.ByteString deriving (Ord, Eq, Show) type Input = B.ByteString -- Processor returns Nothing if it can't process the Input type Processor = Input - Maybe ([Token], Input) -- Tokenize ends the list when all processors return Nothing tokenize :: Input - [Token] tokenize = concat . unfoldr processAll where processors= [doSpaces, doComment, doParens, doName] processAll :: Processor processAll bs = if B.null bs then Nothing else foldr mminus Nothing $ map ($ bs) processors mminus a@(Just _) _ = a mminus Nothingb = b doSpaces:: Processor doSpaces bs = if B.null sp then Nothing else Just ([], nsp) where (sp, nsp) = B.span isSpace bs doComment:: Processor doComment bs = if B.pack # `B.isPrefixOf` bs then Just ([], B.dropWhile (/= '\n') bs) else Nothing doParens :: Processor doParens bs = case B.head bs of '(' - Just ([LParen], B.tail bs) ')' - Just ([RParen], B.tail bs) _ - Nothing doName :: Processor doName bs = if B.null nsp then Nothing else Just ([Name nsp], sp) where (nsp, sp) = B.span (not . isRest) bs isRest c = isSpace c || c == ')' || c == '(' Regards, -- Krzysztof Kościuszkiewicz Skype: dr.vee, Gadu: 111851, Jabber: [EMAIL PROTECTED] Simplicity is the ultimate sophistication -- Leonardo da Vinci ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak - help needed
On Mon, Mar 3, 2008 at 2:23 AM, Krzysztof Kościuszkiewicz [EMAIL PROTECTED] wrote: Dear Haskellers, Another story from an (almost) happy Haskell user that finds himself overwhelmed by laziness/space leaks. I'm trying to parse a large file (600MB) with a single S-expression like structure. With the help of ByteStrings I'm down to 4min processing time in constant space. However, when I try to wrap the parse results in a data structure, the heap blows up - even though I never actually inspect the structure being built! This bugs me, so I come here looking for answers. Well, I haven't read this through, but superficially, it looks like you're expecting the data structure to be constructed lazily. But... -- Syntax of expressions data Exp = Sym !B.ByteString | List ![Exp] deriving (Eq, Show) It is declared as strict, so it's not going to be constructed lazily... Luke ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak - help needed
Krzysztof Kościuszkiewicz wrote: Another story from an (almost) happy Haskell user that finds himself overwhelmed by laziness/space leaks. I'm trying to parse a large file (600MB) with a single S-expression like structure. With the help of ByteStrings I'm down to 4min processing time in constant space. However, when I try to wrap the parse results in a data structure, the heap blows up - even though I never actually inspect the structure being built! This bugs me, so I come here looking for answers. Note that Parsec has to parse the whole file before it can decide whether to return a result (Left _) or an error (Right _). ghc would have to be quite smart to eliminate the creation of the expression tree entirely. The polyparse library (http://www.cs.york.ac.uk/fp/polyparse/) offers some lazy parsers, maybe one of those fits your needs. Text.ParserCombinators.Poly.StateLazy is the obvious candidate. HTH, Bertram ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] space leak?
Massimiliano, I had to update your code for it to compile (removed sequence from testpdf'. However, I don't see any significant difference in the memory profile of either testpdf or testpdf'. Not sure how you are watching the memory usage, but if you didn't know the option +RTS -sstderr will print out useful memory statistics when you run your program. E.g.: pdf_test.exe +RTS -sstderr gives: 2,157,524,764 bytes allocated in the heap 246,516,688 bytes copied during GC (scavenged) 6,086,688 bytes copied during GC (not scavenged) 45,107,704 bytes maximum residency (8 sample(s)) 4086 collections in generation 0 ( 0.61s) 8 collections in generation 1 ( 0.67s) 129 Mb total memory in use INIT time0.02s ( 0.00s elapsed) MUT time5.83s ( 7.48s elapsed) GCtime1.28s ( 1.45s elapsed) RPtime0.00s ( 0.00s elapsed) PROF time0.00s ( 0.00s elapsed) EXIT time0.00s ( 0.00s elapsed) Total time7.13s ( 8.94s elapsed) %GC time 18.0% (16.3% elapsed) Alloc rate369,202,098 bytes per MUT second Productivity 81.8% of total user, 65.2% of total elapsed Above you can see 45 MB was the max amount of memory ever in use - and according to the heap profiling I did it's about constant. I saw the same results when using testpdf'. A few tricks I've learned to reduce space usage: * Use strict returns ( return $! ...) * foldl' over foldr unless you have to use foldr. * Profile, profile, profile - understand who is hanging on to the memory (+RTS -hc) and how it's being used (+RTS -hb). * Use +RTS -p to understand who's doing all the allocations and where your time is being spent. * Approach profiling like a science experiment - make one change, observe if anything is different, rollback and make another change - observer the change. Keep notes! Good luck! Justin ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] space leak?
( these two lines are just to fool the gmane post algorithm which complains for top-posting) Hi, i'm learning Haskell and trying to use the HPDF 1.2 library I've come across some large memory consumption for which I do not understand the origin. I've tried heap profiling but without much success. This is my code module Main where import Control.Monad.State import Graphics.PDF data Opcodes = Rect | Ship deriving (Show) doPage (Rect:ops) = do stroke $! Rectangle 10.0 10.0 10.0 10.0 doPage ops doPage l = return l doOps [] = return () doOps (Ship:ops) = {-# SCC OPSHIP #-} do p - addPage Nothing ops' - drawWithPage p $! do strokeColor red applyMatrix $ (translate 72.0 72.0) doPage ops doOps ops' doOps (op:_) = error (unexpected ++ show op) testpdf = do let ops = concat $ replicate 100 (Ship : (replicate 1000 Rect )) pageRect = PDFRect 0 0 (floor $ 21.0/2.54*72.0) (floor $ 29.7/2.54*72.0) runPdf test1.pdf (standardDocInfo { author=toPDFString mgubi, compressed = False}) pageRect $ doOps ops testpdf' = do let pageRect = PDFRect 0 0 (floor $ 21.0/2.54*72.0) (floor $ 29.7/2.54*72.0) runPdf full.pdf (standardDocInfo { author=toPDFString mgubi, compressed = False}) pageRect $ sequence_ $ foldM f [] $ replicate 100 $ (\p - sequence_ $ replicate 1000 $ drawWithPage p $ stroke $ Rectangle 0.0 0.0 10.0 10.0) where f ps acts = do p - addPage Nothing acts p return $ p:ps main = testpdf now, if I run testpdf' then memory profile is very low and everything is as expected while if I run testpdf then the profile grows up to 80MB and more. This is the stripped down version of the original program (which is a DVI interpreter) so there I will have also some StateT and more complicated opcodes. I would like to know what is wrong with the above code. Could someone help me? thanks, Massimiliano Gubinelli ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space Leak Help
On Saturday 03 February 2007 19:56, Pepe Iborra wrote: pad :: [Word8] - [Word8] pad xs = pad' xs 0 pad' (x:xs) l = x : pad' xs (succ l) pad' [] l = [0x80] ++ ps ++ lb where pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) Pepe, Thanks but this gives me a different problem [EMAIL PROTECTED]:~/sha1 ./allInOne 101 +RTS -hc -RTS [2845392438,1191608682,3124634993,2018558572,2630932637] [2224569924,473682542,3131984545,4182845925,3846598897] Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. Someone suggested pad :: Num a = [a] - [a] pad = pad' 0 where pad' !l [] = [0x80] ++ ps ++ lb where pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) pad' !l (x:xs) = x : pad' (l+1) xs but that didn't compile *Main :r [2 of 2] Compiling Main ( allInOne.hs, interpreted ) allInOne.hs:83:14: Parse error in pattern Failed, modules loaded: Codec.Utils. Dominic. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space Leak Help
On Sun, Feb 04, 2007 at 08:20:23AM +, Dominic Steinitz wrote: Someone suggested pad :: Num a = [a] - [a] pad = pad' 0 where pad' !l [] = [0x80] ++ ps ++ lb where pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) pad' !l (x:xs) = x : pad' (l+1) xs but that didn't compile *Main :r [2 of 2] Compiling Main ( allInOne.hs, interpreted ) allInOne.hs:83:14: Parse error in pattern Failed, modules loaded: Codec.Utils. Dominic. The '!' is a GHC extension, enabled using the flag '-fbang-patterns'. Equivalently, you can use Haskell 98's seq : pad :: Num a = [a] - [a] pad = pad' 0 where pad' l [] | l `seq` False = undefined pad' l [] = [0x80] ++ ps ++ lb where pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) pad' l (x:xs) = x : pad' (l+1) xs The first alternative never succeeds, but to see that the compiler must force the evaluation of 'l'. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space Leak Help
On Saturday 03 February 2007 19:42, [EMAIL PROTECTED] wrote: I have re-written SHA1 so that is more idiomatically haskell and it is easy to see how it implements the specification. The only problem is I now have a space leak. I can see where the leak is but I'm less sure what to do about getting rid of it. Here's the offending function: pad :: [Word8] - [Word8] pad xs = xs ++ [0x80] ++ ps ++ lb where l = length xs pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) I would try something along the following lines (untested): \begin{spec} catWithLen xs f = xs ++ f (length xs) \end{spec} \begin{code} catWithLen :: [a] - (Int - [a]) - [a] catWithLen xs f = h 0 xs where h k [] = f k h k (x : xs) = case succ k of-- forcing evaluation k' - x : h k' xs \end{code} \begin{code} pad :: [Word8] - [Word8] pad xs = catWithLen xs f where f l = 0x80 : ps lb where -- we know that |l = length xs| pl = (64-(l+9)) `mod` 64 ps = funPow pl (0x00 :) lb = i2osp 8 (8*l) \end{code} If you are lucky, then the replicate and the (++lb) in the original code might be fused by the compiler as an instance of foldr-build or something related --- check the optimised core code. Wolfram, Thanks but this gives a different problem: [EMAIL PROTECTED]:~/sha1 ./allInOne 101 +RTS -hc -RTS [2845392438,1191608682,3124634993,2018558572,2630932637] [2224569924,473682542,3131984545,4182845925,3846598897] Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. Dominic. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space Leak Help
On Sun, Feb 04, 2007 at 08:30:44AM +, Dominic Steinitz wrote: On Saturday 03 February 2007 19:42, [EMAIL PROTECTED] wrote: I would try something along the following lines (untested): \begin{spec} catWithLen xs f = xs ++ f (length xs) \end{spec} \begin{code} catWithLen :: [a] - (Int - [a]) - [a] catWithLen xs f = h 0 xs where h k [] = f k h k (x : xs) = case succ k of-- forcing evaluation k' - x : h k' xs Nice try. k', as a variable binding, is irrefutable. \end{code} \begin{code} pad :: [Word8] - [Word8] pad xs = catWithLen xs f where f l = 0x80 : ps lb where -- we know that |l = length xs| pl = (64-(l+9)) `mod` 64 ps = funPow pl (0x00 :) lb = i2osp 8 (8*l) \end{code} Thanks but this gives a different problem: [EMAIL PROTECTED]:~/sha1 ./allInOne 101 +RTS -hc -RTS [2845392438,1191608682,3124634993,2018558572,2630932637] [2224569924,473682542,3131984545,4182845925,3846598897] Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. expected result of the excessive laziness above. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space Leak Help
On Sunday 04 February 2007 08:28, Stefan O'Rear wrote: On Sun, Feb 04, 2007 at 08:20:23AM +, Dominic Steinitz wrote: Someone suggested pad :: Num a = [a] - [a] pad = pad' 0 where pad' !l [] = [0x80] ++ ps ++ lb where pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) pad' !l (x:xs) = x : pad' (l+1) xs but that didn't compile *Main :r [2 of 2] Compiling Main ( allInOne.hs, interpreted ) allInOne.hs:83:14: Parse error in pattern Failed, modules loaded: Codec.Utils. Dominic. The '!' is a GHC extension, enabled using the flag '-fbang-patterns'. The test program runs to completion but still has a space leak consuming over 25m. Equivalently, you can use Haskell 98's seq : pad :: Num a = [a] - [a] pad = pad' 0 where pad' l [] | l `seq` False = undefined pad' l [] = [0x80] ++ ps ++ lb where pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) pad' l (x:xs) = x : pad' (l+1) xs The first alternative never succeeds, but to see that the compiler must force the evaluation of 'l'. [EMAIL PROTECTED]:~/sha1 ./allInOne 101 +RTS -hc -RTS [2845392438,1191608682,3124634993,2018558572,2630932637] [2224569924,473682542,3131984545,4182845925,3846598897] Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. Dominic. PS I appreciate all the help I'm getting. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space Leak Help
On Sun, Feb 04, 2007 at 09:45:12AM +, Dominic Steinitz wrote: pad :: Num a = [a] - [a] pad = pad' 0 where pad' l [] | l `seq` False = undefined Stupid typo, that should be: where pad' l _ | l `seq` False = undefined pad' l [] = [0x80] ++ ps ++ lb where pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) pad' l (x:xs) = x : pad' (l+1) xs The first alternative never succeeds, but to see that the compiler must force the evaluation of 'l'. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space Leak Help
\begin{code} catWithLen :: [a] - (Int - [a]) - [a] catWithLen xs f = h 0 xs where h k [] = f k h k (x : xs) = case succ k of-- forcing evaluation k' - x : h k' xs \end{code} Thanks but this gives a different problem: [EMAIL PROTECTED]:~/sha1 ./allInOne 101 +RTS -hc -RTS [2845392438,1191608682,3124634993,2018558572,2630932637] [2224569924,473682542,3131984545,4182845925,3846598897] Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it. Does it still do that if you youse seq instead of case? \begin{code} catWithLen :: [a] - (Int - [a]) - [a] catWithLen xs f = h 0 xs where h k [] = f k h k (x : xs) = let k' = succ k in k' `seq` x : h k' xs \end{code} Wolfram ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Space Leak Help
I have re-written SHA1 so that is more idiomatically haskell and it is easy to see how it implements the specification. The only problem is I now have a space leak. I can see where the leak is but I'm less sure what to do about getting rid of it. Here's the offending function: pad :: [Word8] - [Word8] pad xs = xs ++ [0x80] ++ ps ++ lb where l = length xs pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) I've thought about zipping the xs with [1..] which will give me a length as I go. Is this the right way to go are there better techniques for dealing with this? I've attached the full source below. Dominic. module Main(main) where import Data.Char import Data.Bits import Data.List import Data.Word import System import Codec.Utils type Rotation = Int rotL :: Rotation - Word32 - Word32 rotL s a = shiftL a s .|. shiftL a (s-32) instance Num [Word32] where a + b = zipWith (+) a b f n x y z | (0 = n) (n = 19) = (x .. y) .|. ((complement x) .. z) | (20 = n) (n = 39) = x `xor` y `xor` z | (40 = n) (n = 59) = (x .. y) .|. (x .. z) .|. (y .. z) | (60 = n) (n = 79) = x `xor` y `xor` z | otherwise = error invalid index for f k n | (0 = n) (n = 19) = 0x5a827999 | (20 = n) (n = 39) = 0x6ed9eba1 | (40 = n) (n = 59) = 0x8f1bbcdc | (60 = n) (n = 79) = 0xca62c1d6 | otherwise = error invalid index for k -- Word120 - Word512 - Word120 oneBlock ss xs = (as!!80):(bs!!80):(cs!!80):(ds!!80):(es!!80):[] where ws = xs ++ map (rotL 1) (zipWith4 xxxor wm3s wm8s wm14s wm16s) where xxxor a b c d = a `xor` b `xor` c `xor` d wm3s = drop (16-3) ws wm8s = drop (16-8) ws wm14s = drop (16-14) ws wm16s = drop (16-16) ws as = (ss!!0):ts bs = (ss!!1):as cs = (ss!!2):(map (rotL 30) bs) ds = (ss!!3):cs es = (ss!!4):ds ts = (map (rotL 5) as) + (zipWith4 f [0..] bs cs ds) + es + (map k [0..]) + ws ss :: [Word32] ss = [0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0] pad :: [Word8] - [Word8] pad xs = xs ++ [0x80] ++ ps ++ lb where l = length xs pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) blockWord8sIn512 :: [Word8] - [[Word8]] blockWord8sIn512 = unfoldr g where g [] = Nothing g xs = Just (splitAt 64 xs) getWord32s :: [Word8] - [Word32] getWord32s s = map f [0..15] where f i = foldl (+) 0 $ map (\n - toEnum (fromEnum (s!!(i*4+n))) `shiftL` (fromIntegral (8 * (3-n [0..3] blockWord32sIn512 :: [Word8] - [[Word32]] blockWord32sIn512 = (map getWord32s) . blockWord8sIn512 . pad -- Word120 - Word512 - Word120 hashOnce ss a = ss + oneBlock ss a hash = foldl' hashOnce ss . blockWord32sIn512 convert :: String - [Word8] convert = map (fromIntegral . ord) short :: [Word8] short = convert abc message :: [Word8] message = convert abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq performance n = (convert . take n . repeat) 'a' test n = mapM_ (putStrLn . show . hash) [short, message, performance n] main = do progName - getProgName args - getArgs if length args /= 1 then putStrLn (Usage: ++ progName ++ testSize) else test (read (args!!0)) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space Leak Help
I have re-written SHA1 so that is more idiomatically haskell and it is easy to see how it implements the specification. The only problem is I now have a space leak. I can see where the leak is but I'm less sure what to do about getting rid of it. Here's the offending function: pad :: [Word8] - [Word8] pad xs = xs ++ [0x80] ++ ps ++ lb where l = length xs pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) I would try something along the following lines (untested): \begin{spec} catWithLen xs f = xs ++ f (length xs) \end{spec} \begin{code} catWithLen :: [a] - (Int - [a]) - [a] catWithLen xs f = h 0 xs where h k [] = f k h k (x : xs) = case succ k of-- forcing evaluation k' - x : h k' xs \end{code} \begin{code} pad :: [Word8] - [Word8] pad xs = catWithLen xs f where f l = 0x80 : ps lb where -- we know that |l = length xs| pl = (64-(l+9)) `mod` 64 ps = funPow pl (0x00 :) lb = i2osp 8 (8*l) \end{code} If you are lucky, then the replicate and the (++lb) in the original code might be fused by the compiler as an instance of foldr-build or something related --- check the optimised core code. In my variant I changed this to rely on efficient function powering instead: \begin{spec} funPow k f = foldr (.) id $ replicate k f \end{spec} \begin{code} funPow :: Int - (a - a) - (a - a) funPow n f = case compare n 0 of LT - error (funPow: negative argument: ++ show n) EQ - id GT - pow n f where pow m g = if m 1 then let (m',r) = divMod m 2 g' = g . g in if r == 0 then pow m' g' else pow m' g' . g else g \end{code} (You will probably also consider using Data.Bits for (`mod` 64), (8*), and (`divMod` 2). ) Wolfram ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space Leak Help
hi Dominic Explicit recursion works just fine for me and keeps things simple: pad :: [Word8] - [Word8] pad xs = pad' xs 0 pad' (x:xs) l = x : pad' xs (succ l) pad' [] l = [0x80] ++ ps ++ lb where pl = (64-(l+9)) `mod` 64 ps = replicate pl 0x00 lb = i2osp 8 (8*l) at the cost of (very slightly) hiding data flow. Seems exactly what you were trying to avoid? Cheers pepe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Space leak whilst implementing streams
Hello, I have been using arrows to implement stream processors. At first, I tried using the implementation presented in John Hughes' AFP arrows lectures. However, this appeared to have a space leak in its implementation of the left operator for ArrowChoice. I found a way to remove this space leak, however, I do not really understand why there was a space leak in the first place. I would really appreciate any light that could be shed on this. Below I include the AFP implementation (SF) and a modified implementation that no longer has the space leak (SF'). I am using ghc 6.4.2. Many thanks. import Control.Arrow import Data.Maybe main = test test = print (runSF p (repeat 1)) test' = print (runSF' p (repeat (Just 1))) -- heap profile appears to grow linearly for test, but not test' p :: ArrowChoice a = a Int (Either Int Int) p = arr Right left (arr id) newtype SF a b = SF {runSF :: [a] - [b]} instance Arrow SF where arr f = SF (map f) SF f SF g = SF (f g) first (SF f) = SF (unzip first f uncurry zip) instance ArrowChoice SF where left (SF f) = SF (\xs - combine xs (f [y | Left y - xs])) where combine (Left _:xs) (z:zs) = Left z :combine xs zs combine (Right r:xs) zs = Right r:combine xs zs combine [] _ = [] -- SF' does not exhibit the space leak newtype SF' a b = SF' {runSF' :: [Maybe a] - [Maybe b]} instance Arrow SF' where arr f = SF' (map (maybe Nothing (Just . f))) SF' f SF' g = SF' (f g) first (SF' f) = SF' (maybe_unzip first f uncurry maybe_zip) where maybe_unzip = foldr mu ([],[]) where mu Nothing ~(xs,ys) = (Nothing:xs, Nothing: ys) mu (Just (x,y)) ~(xs,ys) = (Just x:xs, Just y: ys) maybe_zip = zipWith mz where mz (Just x) (Just y) = Just (x,y) mz Nothing Nothing = Nothing instance ArrowChoice SF' where left (SF' f) = SF' (\xs - xs `combined` f (map drop_right xs)) where combined = zipWith merge_left drop_right Nothing = Nothing drop_right (Just (Left l)) = Just l drop_right (Just (Right _)) = Nothing merge_left Nothing Nothing = Nothing merge_left (Just (Left _)) (Just x) = Just (Left x) merge_left (Just (Right r)) Nothing = Just (Right r) ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak whilst implementing streams
[EMAIL PROTECTED] wrote: I found a way to remove this space leak, however, I do not really understand why there was a space leak in the first place. I would really appreciate any light that could be shed on this. instance ArrowChoice SF where left (SF f) = SF (\xs - combine xs (f [y | Left y - xs])) where combine (Left _:xs) (z:zs) = Left z :combine xs zs combine (Right r:xs) zs = Right r:combine xs zs combine [] _ = [] The list comprehension is holding onto 'xs' until something of the 'zs' is consumed. That only happens when 'xs' contains 'Left _', and your input input stream doesn't. So the whole stream is leaked, which indeed produces a linear space profile. The easiest way to correct it, is to consume the list 'xs' only once: instance ArrowChoice SF where left (SF f) = SF (map f') where f' (Left x) = Left (f x) f' (Right y) = Right y or simply instance ArrowChoice SF where left (SF f) = SF (map (left f)) instance ArrowChoice SF' where left (SF' f) = SF' (\xs - xs `combined` f (map drop_right xs)) where combined = zipWith merge_left Here you're also risking the leak of a whole copy of 'xs', but since you end up consuming both lists at the same pace, nothing bad happens. Udo. -- Nicht alles was hinkt ist ein Vergleich. signature.asc Description: Digital signature ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Space leak when returning pairs?
Dear members, I am experiencing a space leak, which I suspect to be an instance of the problem addressed by Wadler before. I'd appreciate if someone here would take a look. Given the following datatype: data XMLEvent = StartEvent String | EndEvent String | TextEvent String deriving Show where the three constructors represent the start tag (a where a is a string), end tag (/a), and text data, respectively, of an XML stream. (They are actually from the library HXML). The following function simply returns the same stream while doing a minimal amount of validation (ignoring the closing tag). idX :: [XMLEvent] - ([XMLEvent], [XMLEvent]) idX [] = ([], []) idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') idX (EndEvent _: strm) = ([], strm) idX (TextEvent s : strm) = let (ts, strm') = idX strm in (TextEvent s : ts, strm') The function idX returns a pair, where the first component is the processed stream, while the second component is the rest of the input. The intention is to thread the input and release processed events. If the function is used in a pipelined manner: print . fst . idX . parseInstance $ input where parseInstance :: String - [XMLEvent] is a lexical analyser, I would expect that the input stream will be consumed one by one as the output is produced, and the program would use a constant amount of heap space (while keeping a program stack proportional to the depth of the XML tree). This was not the case, however. For some reason the heap grew linearly. The GHC profiler showed that most of the thunks are created by parseInstance, which implies that the entire input stream somehow still resides in memory. I was wondering where the space leak came from and suspected that it's the leak described in one of Philip Wadler's early paper Fixing Some Space Leaks With a Garbage Collector (1987). Consider idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') The body of the let clause might have actually been treated as (StartEvent a [] : ts ++ EndEvent a : fst (idX (snd strm)), snd (idX (snd strm))) Therefore strm will not be freed until the whole evaluation finishes. But since Wadler has addressed this problem a long time ago, I think the fix he proposed should have been implemented in GHC already. Was that really the source of the space leak? If so is there a way to fix it? Or is there another reason for the leak? * * * The function idX above actually resulted from fusing two functions: buildF parses a stream into a forest, while idF flattens the tree. data ETree = Element String [ETree] | Text String buildF :: [XMLEvent] - ([ETree], [XMLEvent]) buildF [] = ([],[]) buildF (StartEvent a : strm) = let (ts, strm') = buildF strm (us, strm'') = buildF strm' in (Element a ts : us, strm'') buildF (EndEvent _ : strm) = ([], strm) buildF (TextEvent s : strm) = let (ts, strm') = buildF strm in (Text s : ts, strm') idF :: [ETree] - [XMLEvent] idF [] = [] idF (Element a ts : us) = StartEvent a : idF ts ++ EndEvent a : idF us idF (Text s : ts) = TextEvent s : idF ts My original program was like print . idF . fst . buildF . parseInstance $ input and it had the same space leak. By accident (assuming that the input is always a single tree), I discovered that if I added a wrap . head into the pipe: print . idF . wrap . head . fst . buildF . parseInstance $ input where wrap a = [a], the heap residency would reduce by half! The output remains the same. My explanation is that applying a head means that (in the definition of buildF): buildF (StartEvent a : strm) = let (ts, strm') = buildF strm (us, strm'') = buildF strm' in (Element a ts : us, strm'') the us part , which contains a reference to strm, need not be kept. Thus the reduced heap residency. This seems to support that the space leak resulted from the same problem Wadler addressed before. But isn't that solved already in GHC? * * * I'd appreciate if someone could look into it. The actual program is available at http://www.psdlab.org/~scm/hx.tar.gz It actually uses HXML, a library by Joe (Joe English?) to do the parsing. The main program is hxpc.hs. There is a 1 MB sized sample input file size1m.xml. There are two simple scripts recompile and runhxpc for compiling and running the program. Thank you! sincerely, Shin-Cheng Mu ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak when returning pairs?
On Fri, 19 May 2006, Shin-Cheng Mu wrote: idX :: [XMLEvent] - ([XMLEvent], [XMLEvent]) idX [] = ([], []) idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') idX (EndEvent _: strm) = ([], strm) idX (TextEvent s : strm) = let (ts, strm') = idX strm in (TextEvent s : ts, strm') let ~(ts, strm') = idX strm ~(us, strm'') = idX strm' ? ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak when returning pairs?
Dear Henning, On May 19, 2006, at 6:16 PM, Henning Thielemann wrote: On Fri, 19 May 2006, Shin-Cheng Mu wrote: idX :: [XMLEvent] - ([XMLEvent], [XMLEvent]) idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') let ~(ts, strm') = idX strm ~(us, strm'') = idX strm' Oh, I just tried using lazy patterns for the case for StartEvent and TextEvent. But the space behaviour remained the same. :( sincerely, Shin-Cheng Mu ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak when returning pairs?
Henning Thielemann [EMAIL PROTECTED] wrote: let ~(ts, strm') = idX strm ~(us, strm'') = idX strm' Let-bindings are already lazy, so the ~ here is redundant. Regards, Malcolm ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak when returning pairs?
On May 19, 2006, at 6:16 PM, Henning Thielemann wrote: let ~(ts, strm') = idX strm ~(us, strm'') = idX strm' I seem to have found a partial solution to the problem. It's rather ugly, however, and I think there should be a better way. The original definition for one of the clauses was like this: idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') The intention was to thread strm through the recursive calls and free the items in strm as soon as they are processed. However, with this definition I needed about 8Mb of heap for a 1Mb input file. Since I suspected that delayed evaluation of us and strm'' keeps the pointer to strm around for too long, the natural solution seems to be forcing their evaluation. So I tried: idX (StartEvent a : strm) = case idX strm of (ts, strm') - case idX strm' of (us, strm'') - (StartEvent a [] : ts ++ EndEvent a : us, strm'') which made things worse. The heap size increased a bit more and I get a typical bell shaped heap profile suggesting idX cannot output anything before processing the whole input. So I have to allow idX to output something while consuming the input. What eventually worked was this messy mixture of let and case: idX (StartEvent a : strm) = let (xs, rest) = case idX strm of (ts, strm') - let (ws, rest) = case idX strm' of (us, strm'') - (us, strm'') in (ts ++ EndEvent a : ws, rest) in (StartEvent a [] : xs, rest) The heap residency reduced from about 8MB to around 80Kb. This is rather tricky, however. The following nicer looking version has a heap profile as bad as the worse case. idX (StartEvent a : strm) = let (ts,us,strm'') = case idXsp strm of (ts, strm') - case idXsp strm' of (us, strm'') - (ts, us, strm'') in (StartEvent a [] : ts ++ EndEvent a : us, strm'') And I cannot quite explain why (Anyone?). Is there a more structured solution? Can I leave the hard work to the compiler? sincerely, Shin-Cheng Mu ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak when returning pairs?
Shin-Cheng Mu wrote: Dear members, I am experiencing a space leak, which I suspect to be an instance of the problem addressed by Wadler before. I'd appreciate if someone here would take a look. Given the following datatype: data XMLEvent = StartEvent String | EndEvent String | TextEvent String deriving Show where the three constructors represent the start tag (a where a is a string), end tag (/a), and text data, respectively, of an XML stream. (They are actually from the library HXML). The following function simply returns the same stream while doing a minimal amount of validation (ignoring the closing tag). idX :: [XMLEvent] - ([XMLEvent], [XMLEvent]) idX [] = ([], []) idX (StartEvent a : strm) = let (ts, strm') = idX strm (us, strm'') = idX strm' in (StartEvent a [] : ts ++ EndEvent a : us, strm'') idX (EndEvent _: strm) = ([], strm) idX (TextEvent s : strm) = let (ts, strm') = idX strm in (TextEvent s : ts, strm') The function idX returns a pair, where the first component is the processed stream, while the second component is the rest of the input. The intention is to thread the input and release processed events. If the function is used in a pipelined manner: print . fst . idX . parseInstance $ input I would not have written idX in this manner. My version is data XMLEvent = StartEvent String | EndEvent String | TextEvent String deriving Show idX' :: [XMLEvent] - [XMLEvent] idX' = untilTags [] untilTags :: [String] - [XMLEvent] - [XMLEvent] untilTags [] [] = [] untilTags tags [] = error (Did not find closing EndEvents ++ show tags) untilTags tags (x:xs) = case x of StartEvent tag' - x : untilTags (tag':tags) xs EndEvent tag' - if null tags then error (Unexpected EndEvent with tag ++ show tag') else let (tag:tags') = tags in if tag == tag' then x : untilTags tags' xs else error (Expected EndEvents ++ show tags ++ but found EndEvent ++ show tag') TextEvent _ - x : untilTags tags xs Here the flow of the input x : ... to the output x : ... is more obvious, and the open tags are kept in a stack and used to check the closing tags. The memory usage will be only slightly higher than your idX due to keeping the list of open tag names. If you want to remove that overhead and assume the ending closing tags have correct strings, then you can keep a mere count of open tags: countTags :: Int - [XMLEvent] - [XMLEvent] countTags 0 [] = [] countTags n [] = error (Did not find the closing EndEvents ++ show n) countTags n (x:xs) = case x of StartEvent tag' - x : countTags (succ n) xs EndEvent tag' - if 0 == n then error (Unexpected EndEvent with tag ++ show tag') else x : countTags (pred n) xs TextEvent _ - x : countTags n xs ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak when returning pairs?
Shin-Cheng Mu [EMAIL PROTECTED] wrote: I was wondering where the space leak came from and suspected that it's the leak described in one of Philip Wadler's early paper Fixing Some Space Leaks With a Garbage Collector (1987). But since Wadler has addressed this problem a long time ago, I think the fix he proposed should have been implemented in GHC already. Was that really the source of the space leak? If so is there a way to fix it? Or is there another reason for the leak? I compiled and profiled your code using nhc98 instead of ghc. There was no space leak visible. Since nhc98 does implement the Wadler GC fix (with refinements by Sparud), I can only assume that this does indeed account for the space improvement over ghc. Regards, Malcolm ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe