Re: [Haskell-cafe] Space leak in hexpat-0.20.3/List-0.5.1

2013-04-30 Thread oleg

Wren Thornton wrote:
 So I'm processing a large XML file which is a database of about 170k
 entries, each of which is a reasonable enough size on its own, and I only
 need streaming access to the database (basically printing out summary data
 for each entry). Excellent, sounds like a job for SAX.

Indeed a good job for a SAX-like parser. XMLIter is exactly such
parser, and it generates event stream quite like that of Expat. Also
you application is somewhat similar to the following
http://okmij.org/ftp/Haskell/Iteratee/XMLookup.hs

So, it superficially seems XMLIter should be up for the task. Can you
explain which elements your are counting? BTW, xml_enum already checks
for the well-formedness of XML (including the start-end tag
balance, and many more criteria). One can assume that the XMLStream
corresponds to the well-formed document and only count the desired
start tags (or end tags, for that matter).




___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Space leak in hexpat-0.20.3/List-0.5.1

2013-04-28 Thread wren ng thornton
Hello all,

So I'm processing a large XML file which is a database of about 170k
entries, each of which is a reasonable enough size on its own, and I only
need streaming access to the database (basically printing out summary data
for each entry). Excellent, sounds like a job for SAX.

However, after whipping up a simplified version of the program using
hexpat, there's a space leak. Near as I can tell, it's not a problem with
my code, it's a problem with Data.List.Class (or hexpat's use thereof).
The simplified code follows, just compile it for profiling and use hp2ps
to see what I mean. The file I'm running it on can be found at:

ftp://ftp.monash.edu.au/pub/nihongo/JMdict.gz

Any ideas on what the problem really is, or how to fix it?




module JMdict (main) where

import   Text.XML.Expat.SAX   (SAXEvent(..))
import qualified Text.XML.Expat.SAX   as SAX
import   Text.XML.Expat.Tree  (NodeG(..))
import qualified Text.XML.Expat.Tree  as DOM
import qualified Text.XML.Expat.Proc  as Proc
import qualified Text.XML.Expat.Internal.NodeClass as Node
import qualified Data.ByteString.Lazy as BL
import   Data.Char(isSpace)
import   Data.Text.IO as TIO
import qualified Data.Textas T
import   Control.Applicative  (($))
import   Control.Monad(forM_)
import qualified System.IOas IO
import qualified System.Environment   as Sys (getArgs)
import qualified System.Exit  as Sys (exitFailure)
import qualified System.Directory as Sys
(doesFileExist, getPermissions, readable)


-- | A variant of 'Control.Monad.unless' for when the boolean is
-- also monadic.
unlessM :: Monad m = m Bool - m () - m ()
unlessM mb handle = do
b - mb
if b then return () else handle


-- | If the file does not exist or is not readable, then crash the
-- program.
assertFileExistsReadable :: FilePath - IO ()
assertFileExistsReadable file = do
unlessM (Sys.doesFileExist file) $ do
IO.hPutStrLn IO.stderr $ No such file: ++file
Sys.exitFailure
unlessM (Sys.readable $ Sys.getPermissions file) $ do
IO.hPutStrLn IO.stderr $ File not readable: ++file
Sys.exitFailure


main :: IO ()
main = do
files - Sys.getArgs
forM_ files $ \file - do
assertFileExistsReadable file
countElements 0
. filter (not . isWhitespace)
. dropPreamble
. SAX.parse SAX.defaultParseOptions
= BL.readFile file
where
dropPreamble (StartElement t _ : xs) | t == T.pack JMdict = xs
dropPreamble (_:xs) = dropPreamble xs
dropPreamble [] = []

isWhitespace (CharacterData c) | T.all isSpace c = True
isWhitespace _   = False

countElements :: Int - [SAXEvent T.Text T.Text] - IO ()
countElements n [] = print n
countElements n xs =
case anyElement xs of
(Left  err, xs') - fail $ show err ++: ++ show (take 3 xs')
(Right ell, xs') - do
print (n+1)
(countElements $! n+1) xs'


data ElementError
= EmptyStream
| NoStartElement
| EndOfStream
| InvalidXML
deriving (Read, Show, Eq)

-- | Split an event stream into an initial element and the remainder
-- of the stream. Use 'DOM.saxToTree' to convert the element to a
-- tree.
anyElement
:: (Eq tag)
= [SAXEvent tag text]
- (Either ElementError [SAXEvent tag text], [SAXEvent tag text])
anyElement = start
where
start [] = (Left EmptyStream, [])
start xxs@(x:xs) =
case x of
StartElement t _ - loop [t] (x:) xs
_- (Left NoStartElement, xxs)

loop _  _ [] = (Left EndOfStream, [])
loop [] k xs = (Right (k []), xs)
loop tts@(t:ts) k xxs@(x:xs) = step (k . (x:)) xs
where
step =
case x of
StartElement t' _ - loop (t':tts)
EndElement   t'
| t' == t - loop ts
| otherwise   - \_ _ - (Left InvalidXML, xxs)
_ - loop tts


--- fin.

-- 
Live well,
~wren


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] space leak when repeatedly calling Control.Monad.State.Strict.modify

2012-01-30 Thread Yves Parès
Have you tried to compile your code with optimisations? I guess GHC's
strictness analysis would find strict evaluation is better here.


2012/1/30 Joey Hess j...@kitenet.net

 Claude Heiland-Allen wrote:
  Control.Monad.State.Strict is strict in the actions, but the state
  itself is still lazy, so you end up building a huge thunk in the
  state containing all the updates that ever took place to the initial
  state.
 
  Using this should fix it:
 
  modify' :: MonadState s m = (s - s) - m ()
  modify' f = do
x - get
put $! f x  -- force the new state when storing it

 Thanks!

 So, why does Control.Monad.State.Strict.modify not do that?

 And, I still don't quite understand why this only happened
 when the updated value is obtained using IO.

 --
 see shy jo

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] space leak when repeatedly calling Control.Monad.State.Strict.modify

2012-01-30 Thread Joey Hess
Yves Parès wrote:
 Have you tried to compile your code with optimisations? I guess GHC's
 strictness analysis would find strict evaluation is better here.

The original code I saw this happen to the wild was built with -O2.
I didn't try building the test case with optimisations.

-- 
see shy jo


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] space leak when repeatedly calling Control.Monad.State.Strict.modify

2012-01-29 Thread Joey Hess
The attached test case quickly chews up hundreds of MB of memory.
If modified to call work' instead, it runs in constant space.

Somehow the value repeatedly read in from the file and stored in
the state is leaking. Can anyone help me understand why?

(ghc 7.0.4)

-- 
see shy jo
{-# LANGUAGE GeneralizedNewtypeDeriving, BangPatterns #-}

module Main where

import Control.Monad.State.Strict

data MyState = MyState { val :: String }

newtype Foo a = Foo { run :: StateT MyState IO a }
	deriving (
		Monad,
		MonadState MyState,
		MonadIO
	)

main :: IO ()
main = evalStateT (run test) (MyState )
	
test :: Foo ()
test = mapM_ work [1..10]-- massive memory leak
--test = mapM_ work' [1..10] -- constant space

readSomeFile :: Foo String
readSomeFile = liftIO $ readFileStrict /etc/passwd

work :: Integer - Foo ()
work _ = do
	v - readSomeFile
	modify $ \s - s { val = v }

work' :: Integer - Foo ()
work' n = do
	_ - readSomeFile
	modify $ \s - s { val = show n }

readFileStrict :: FilePath - IO String
readFileStrict file = do
	s - readFile file
	length s `seq` return s


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] space leak when repeatedly calling Control.Monad.State.Strict.modify

2012-01-29 Thread Claude Heiland-Allen

Hi,

On 30/01/12 01:07, Joey Hess wrote:

The attached test case quickly chews up hundreds of MB of memory.
If modified to call work' instead, it runs in constant space.

Somehow the value repeatedly read in from the file and stored in
the state is leaking. Can anyone help me understand why?


Control.Monad.State.Strict is strict in the actions, but the state 
itself is still lazy, so you end up building a huge thunk in the state 
containing all the updates that ever took place to the initial state.


Using this should fix it:

modify' :: MonadState s m = (s - s) - m ()
modify' f = do
  x - get
  put $! f x  -- force the new state when storing it

With the attached code, the first case (using modify) prints out a trace 
like:


test
work:1
modify
work:2
modify
work:3
modify
work:4
modify
work:5
modify
work:6
modify
work:7
modify
work:8
modify
work:9
modify
work:10
modify
update:vnbz
update:dzgd
update:hzla
update:nudd
update:bzfl
update:muht
update:hims
update:jakj
update:lvrt
update:qdxo
initial
MyState {val = vnbz}

Notice how the state updates are only evaluated right at the end, when 
the value is forced - note also that this means that all the data needs 
to hang around until then.


The second case (using modify') forces the state as it goes along:

test'
work:1
modify'
update:zwre
initial
work:2
modify'
update:fefg
work:3
modify'
update:eoqa
work:4
modify'
update:xtak
work:5
modify'
update:tekd
work:6
modify'
update:qrsz
work:7
modify'
update:fdgj
work:8
modify'
update:alwj
work:9
modify'
update:kqsp
work:10
modify'
update:lazz
MyState {val = lazz}



Claude
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
module Main where

import Debug.Trace
import System.Random

import Control.Monad.State.Strict

modify' :: MonadState s m = (s - s) - m ()
modify' f = do
  x - get
  put $! f x

data MyState = MyState { val :: String } deriving Show

newtype Foo a = Foo { run :: StateT MyState IO a }
	deriving (
		Monad,
		MonadState MyState,
		MonadIO
	)

main :: IO ()
main = do
  print = execStateT (run test) (trace initial $ MyState )
  print = execStateT (run test') (trace initial $ MyState )
	
test :: Foo ()
test = trace test $ mapM_ work [1..10]-- massive memory leak

test' :: Foo ()
test' = trace test' $ mapM_ work' [1..10]

work :: Integer - Foo ()
work n = trace (work:++show n) $ do
  v - readSomeFile
  trace modify modify $ trace (update:++v) (\s - s { val = v })

work' :: Integer - Foo ()
work' n = trace (work:++show n) $ do
  v - readSomeFile
  trace modify' modify' $ trace (update:++v) (\s - s { val = v })

readSomeFile :: Foo String
readSomeFile = liftIO $ replicateM 4 (randomRIO ('a', 'z'))
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] space leak when repeatedly calling Control.Monad.State.Strict.modify

2012-01-29 Thread Joey Hess
Claude Heiland-Allen wrote:
 Control.Monad.State.Strict is strict in the actions, but the state
 itself is still lazy, so you end up building a huge thunk in the
 state containing all the updates that ever took place to the initial
 state.
 
 Using this should fix it:
 
 modify' :: MonadState s m = (s - s) - m ()
 modify' f = do
   x - get
   put $! f x  -- force the new state when storing it

Thanks! 

So, why does Control.Monad.State.Strict.modify not do that?

And, I still don't quite understand why this only happened
when the updated value is obtained using IO.

-- 
see shy jo


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak with unsafePerformIO

2010-06-28 Thread Henning Thielemann


On Sun, 27 Jun 2010, Henning Thielemann wrote:


Maybe I can combine splitAtLazy and (++) to a function like
 splitAtAndAppend :: [x] - ([a] - [b]) - ([a] - [b]) - [a] - [b]
but I'm afraid I will need pairs temporarily and then I run into the same 
problems.


I have now implemented a solution that is tailored to special function 
types in the first ([a] - [b]) argument. It works, but it is sad that the 
general, more declarative solution is so fragile.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak with unsafePerformIO

2010-06-27 Thread Bertram Felgenhauer
Henning Thielemann wrote:
 Attached is a program with a space leak that I do not understand. I
 have coded a simple 'map' function, once using unsafePerformIO and
 once without. UnsafePerformIO has a space leak in some circumstances.
 In the main program I demonstrate cases with and without space leak.
 Without space leak the program writes a file to the disk until it
 is full. Any idea?

The program relies on the GC doing short-cut evaluation of record
selectors to avoid a space leak. If the user of the function
splitAtLazy

| splitAtLazy :: [b] - [a] - ([a],[a])
| splitAtLazy nt xt =
|(\ ~(ys,zs) - (ys,zs)) $
|case (nt,xt) of
|   (_:ns, x:xs) -
|  let (ys,zs) = splitAtLazy ns xs
|  in  (x:ys,zs)
|   (_, xs) - ([],xs)

somehow holds on to a reference of the returned pair while processing
the first part of the list, there will be a space leak, because that
means that the whole prefix remains reachable.

splitAtLazy itself is not leaky, because the value returned by the
recursive call is scrutinized as follows,

  Main.$wsplitAtLazy =
  ...
  (# case ds_sLb of wild_B1 { (ys_agC, zs_agE) - ys_agC },
 case ds_sLb of wild_B1 { (ys_agC, zs_agE) - zs_agE } #)

and ghc turns that into record selector thunks in the code generator.

The precise rule can be found in compiler/codeGen/StgCmmBind.hs:

| Note [Selectors]
| ~~~
| We look at the body of the closure to see if it's a selector---turgid,
| but nothing deep.  We are looking for a closure of {\em exactly} the
| form:
| 
| ...  = [the_fv] \ u [] -
|  case the_fv of
|con a_1 ... a_n - a_i

Now let's look at how the result of splitAtLazy is used.

non-leaky version (case 0):

  Main.lvl1 =
case Main.$wsplitAtLazy @ () @ GHC.Types.Char Main.xs Main.xs1
of ww_sLv { (# ww1_sLx, ww2_sLy #) -
Main.go (GHC.Base.++ @ GHC.Types.Char ww1_sLx ww2_sLy)
}

The return values are passed on to (++) directly. The result pair is
actually never built at all, so no reference to it can be kept.

leaky version (case 3):

  Main.ds =
case Main.$wsplitAtLazy @ () @ GHC.Types.Char Main.xs Main.xs1
of ww_sKy { (# ww1_sKA, ww2_sKB #) -
(ww1_sKA, ww2_sKB)
}

This builds the pair returned by splitAtLazy.

  Main.lvl1 =
case Main.ds of wild_Xw { (prefix_aCf, suffix_aCh) - prefix_aCf }

Use of prefix: it's a record selector. This is fine.

  Main.lvl2 = Main.go Main.lvl1

The prefix is then passed to some worker function.

  Main.lvl3 =
case Main.ds of wild_Xw { (prefix_aCf, suffix_aCh) -
Main.go1 suffix_aCh
}

Use of suffix: Due to the call of  Main.go1  this is *not* a record
selector. It is compiled to an actual case expression, which to the
garbage collector looks just like an ordinary thunk. A reference to
Main.ds is kept around until the suffix is about to be processed
and a memory leak ensues.

If the compiler had produced

  Main.lvl3 =
case Main.ds of wild_Xw { (prefix_aCf, suffix_aCh) -
suffix_aCh
}

  Main.lvl4 = Main.go1 Main.lvl3

instead, then there would not be a leak. This whole record selector
thunk business is very fragile.

The good news is that the problem is completely unrelated to
unsafePerformIO (the presence of unsafePerformIO makes optimisations
more difficult, but any pure function of sufficient complexity would
have the same effect).

There's a simple fix for the problem, too: Change

   let (prefix, suffix) = makeTwoLists 'a'

to
  let !(prefix, suffix) = makeTwoLists 'a'

in which case the compiler produces code similar to the non-leaky case
for all alternatives.

HTH,

Bertram
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak with unsafePerformIO

2010-06-27 Thread Henning Thielemann


On Sun, 27 Jun 2010, Bertram Felgenhauer wrote:


If the compiler had produced

 Main.lvl3 =
   case Main.ds of wild_Xw { (prefix_aCf, suffix_aCh) -
   suffix_aCh
   }

 Main.lvl4 = Main.go1 Main.lvl3

instead, then there would not be a leak. This whole record selector
thunk business is very fragile.


Bertram, thank you a lot for the detailed analysis! It seems that I have 
stepped into a nasty implementation detail of GHC.



The good news is that the problem is completely unrelated to
unsafePerformIO (the presence of unsafePerformIO makes optimisations
more difficult, but any pure function of sufficient complexity would
have the same effect).

There's a simple fix for the problem, too: Change


  let (prefix, suffix) = makeTwoLists 'a'


to
 let !(prefix, suffix) = makeTwoLists 'a'

in which case the compiler produces code similar to the non-leaky case
for all alternatives.


Actually this fix works for my example program and another one that is 
based on chunky StorableVector. But when I switch off profiling the 
StorableVector example runs into a space leak, again, while the one with 
plain lists that I have sent here still works. I'll investigate into this, 
but it seems indeed very fragile and I wonder whether there is a more 
reliable way.


Maybe I can combine splitAtLazy and (++) to a function like
  splitAtAndAppend :: [x] - ([a] - [b]) - ([a] - [b]) - [a] - [b]
 but I'm afraid I will need pairs temporarily and then I run into the same 
problems.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Space leak with unsafePerformIO

2010-06-26 Thread Henning Thielemann
Attached is a program with a space leak that I do not understand. I have
coded a simple 'map' function, once using unsafePerformIO and once
without. UnsafePerformIO has a space leak in some circumstances. In the
main program I demonstrate cases with and without space leak. Without
space leak the program writes a file to the disk until it is full. Any idea?

The original problem is a function that is compiled by LLVM and shall be
applied to a list in a mapAccumL manner.
{-
$ ghc --make -Wall -O -prof -auto-all InterleaveIO.hs
$ InterleaveIO +RTS -M1m -prof-all -RTS
-}
import System.IO.Unsafe (unsafePerformIO, unsafeInterleaveIO, )

makeSuccUnsafe1 :: IO ([Char] - [Char])
makeSuccUnsafe1 =
   return $
  \ sig - unsafePerformIO $ do
 let go xt =
   unsafeInterleaveIO $
   case xt of
  [] - return []
  x:xs - fmap (succ x :) $ go xs
 go sig

makeSuccUnsafe :: IO ([Char] - [Char])
makeSuccUnsafe =
   return $
  \ sig -
 let go xt =
   unsafePerformIO $
   case xt of
  [] - return []
  x:xs - return (succ x : go xs)
 in  go sig

makeSuccPlain :: IO ([Char] - [Char])
makeSuccPlain =
   return $
  \ sig -
 let go xt =
   case xt of
  [] - []
  x:xs - succ x : go xs
 in  go sig

splitAtLazy :: [b] - [a] - ([a],[a])
splitAtLazy nt xt =
   (\ ~(ys,zs) - (ys,zs)) $
   case (nt,xt) of
  (_:ns, x:xs) -
 let (ys,zs) = splitAtLazy ns xs
 in  (x:ys,zs)
  (_, xs) - ([],xs)


makeTwoLists :: Char - ([Char], [Char])
makeTwoLists c =
   splitAtLazy (repeat ()) $ repeat c

main :: IO ()
main = do
   succUnsafe - makeSuccUnsafe
   succPlain  - makeSuccPlain
   writeFile test.txt $
  let (prefix, suffix) = makeTwoLists 'a'
  in  case 3::Int of
 -- no leak
 0 - succUnsafe $ prefix ++ suffix
 -- no leak
 1 - succPlain  $ prefix ++ suffix
 -- no leak
 2 - succPlain  prefix ++ succPlain  suffix
 -- leak
 3 - succUnsafe prefix ++ succPlain  suffix
 -- no leak
 4 - succPlain  prefix ++ succUnsafe suffix
 -- leak
 5 - succUnsafe prefix ++ succUnsafe suffix
 _ - error not implemented
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-15 Thread Ketil Malde
Arnoldo Muller arnoldomul...@gmail.com writes:

 I am trying to use haskell in the analysis of bio data. One of the main
 reasons I wanted to use haskell is because lazy I/O allows you to see a
 large bio-sequence as if it was a string in memory.

Funny you should mention it.  I've written a bioinformatics library¹ that
(naturally) supports reading and writing various file formats for
sequences and alignments and stuff.

Some of these files can be substantial in size (i.e., larger than my
laptop's memory), so most IO of potentially large files (Fasta, BLAST
XMl output, 454 SFF files...) are read lazily, and large Fasta sequences
are read as lazy bytestrings.

This works nicely for a lot of use cases (well, my use cases, at any
rate, wich quite often boils down to streaming through the data).  One
thing to look out for is O(n) indexed access to lazy bytestrings, so
there's a defragment operation that converts a sequence to a single
chunk (which gives O(1) access, but of course must fit into memory). 

I guess the most annoying thing about laziness is that small test cases
always work, you need Real Data to stress test your programs for
excessive memory use.

Lazy IO always worked well for me, so althouhg I feel I should look more
deeply into real solutions, like Iteratee, my half-hearted attemts to
do so have only resulted in the conclusion that it was more complicated,
and thus postponed for some rainy day... lazy IO for lazy programmers, I
guess. 

-k

¹ Stuff's on Hackage in the bioinformatics section and also on
http://blog.malde.org and http//malde.org/~ketil/bioinformatics.
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-13 Thread Jason Dagit
On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller arnoldomul...@gmail.comwrote:

 Daniel,

 Thank you so much for helping me out with this issue!

 Thanks to all the other answers from haskel-cafe members too!

 As a newbie, I am not able to understand why zip and map would make a
 problem...

 Is there any link I could read that could help me to understand why in this
 case
 zip and map created a leak? What are some function compositions that should
 be
 avoided when doing lazy I/O?


Actually, it's lazy I/O itself that should be avoided.

Jason
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-13 Thread Jason Dagit
On Wed, Mar 10, 2010 at 2:03 PM, Arnoldo Muller arnoldomul...@gmail.comwrote:

 Hello Justin,

 I tried and what I saw was a constant increase in memory usage.
 Any particular profiling option that you would use?


A great place to get started with profiling is the chapter in Real-World
Haskell:
http://book.realworldhaskell.org/read/profiling-and-optimization.html

For a problem like this I would look at general heap profiling (-hc),
retainer profiling (-hr), and also type profiling (-hy) to see if any of
them provide new insight.  For example, -hc might tell you which functions
are problematic, but -hr is more likely to help you there.  Sometimes the
specific type that is leaking is not what you think and that's why -hy is
nice.

I hope that helps,
Jason
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-13 Thread Arnoldo Muller
Jason,

I am trying to use haskell in the analysis of bio data. One of the main
reasons I wanted to use haskell is because lazy I/O allows you to see a
large bio-sequence as if it was a string in memory.
In order to achieve the same result in an imperative language I would have
to write lots of error-prone iterators. I saw lazy I/O as a very strong
point in favor of Haskell.

Besides the space leaks that can occur and that are a bit difficult to find
for a newbie like me, are there any other reasons to avoid Lazy I/O?

Arnoldo.

On Sat, Mar 13, 2010 at 6:46 PM, Jason Dagit da...@codersbase.com wrote:



 On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller 
 arnoldomul...@gmail.comwrote:

 Daniel,

 Thank you so much for helping me out with this issue!

 Thanks to all the other answers from haskel-cafe members too!

 As a newbie, I am not able to understand why zip and map would make a
 problem...

 Is there any link I could read that could help me to understand why in
 this case
 zip and map created a leak? What are some function compositions that
 should be
 avoided when doing lazy I/O?


 Actually, it's lazy I/O itself that should be avoided.

 Jason

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-13 Thread David Leimbach
On Sat, Mar 13, 2010 at 3:58 PM, Arnoldo Muller arnoldomul...@gmail.comwrote:

 Jason,

 I am trying to use haskell in the analysis of bio data. One of the main
 reasons I wanted to use haskell is because lazy I/O allows you to see a
 large bio-sequence as if it was a string in memory.
 In order to achieve the same result in an imperative language I would have
 to write lots of error-prone iterators. I saw lazy I/O as a very strong
 point in favor of Haskell.


There's a safer lazy IO lib in Hackage:

http://hackage.haskell.org/package/safe-lazy-io

It seems the safer approach, though somewhat more confusing to some people,
is the Iteratee pattern.

The reasons why have probably been explained best on a paper on Oleg's site.



 Besides the space leaks that can occur and that are a bit difficult to find
 for a newbie like me, are there any other reasons to avoid Lazy I/O?


Perhaps these two links will enlighten you.   They did for me, and I'm now
working out how exactly to convert a really inefficient but explicit IO
program (char by char right now... yuck) to an Iteratee based parsing
situation on a work-related project.  Hopefully I'll be doing this all this
coming week, and I'll be able to publish some results on my blog.  (things
come up though a lot at work, so I'm keeping my fingers crossed on this
one).

http://okmij.org/ftp/Haskell/Iteratee/Lazy-vs-correct.txt
http://okmij.org/ftp/Streams.html

Dave


 Arnoldo.


 On Sat, Mar 13, 2010 at 6:46 PM, Jason Dagit da...@codersbase.com wrote:



 On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller 
 arnoldomul...@gmail.comwrote:

 Daniel,

 Thank you so much for helping me out with this issue!

 Thanks to all the other answers from haskel-cafe members too!

 As a newbie, I am not able to understand why zip and map would make a
 problem...

 Is there any link I could read that could help me to understand why in
 this case
 zip and map created a leak? What are some function compositions that
 should be
 avoided when doing lazy I/O?


 Actually, it's lazy I/O itself that should be avoided.

 Jason



 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-13 Thread Rafael Almeida
On Sat, Mar 13, 2010 at 8:58 PM, Arnoldo Muller arnoldomul...@gmail.com wrote:
 Jason,

 I am trying to use haskell in the analysis of bio data. One of the main
 reasons I wanted to use haskell is because lazy I/O allows you to see a
 large bio-sequence as if it was a string in memory.
 In order to achieve the same result in an imperative language I would have
 to write lots of error-prone iterators. I saw lazy I/O as a very strong
 point in favor of Haskell.


What about mmap function? It's available on Linux, you can use it on C
and probably a lot of other imperative languages. Was the
non-portability factor the issue with using it?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-13 Thread Brandon S. Allbery KF8NH

On Mar 13, 2010, at 18:58 , Arnoldo Muller wrote:
In order to achieve the same result in an imperative language I  
would have to write lots of error-prone iterators. I saw lazy I/O as  
a very strong point in favor of Haskell.


Besides the space leaks that can occur and that are a bit difficult  
to find for a newbie like me, are there any other reasons to avoid  
Lazy I/O?



The biggest problem is that it is completely impossible to detect,  
much less recover from, lazy I/O errors.  (You could theoretically  
force the result under control of evaluate, thus putting it back in  
IO, but then you lose all the laziness you wanted.  Exceptions, in  
particular I/O exceptions, are by definition impure; so pure code can  
neither recognize nor deal with them.)


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-13 Thread Daniel Fischer
Am Sonntag 14 März 2010 00:58:09 schrieb Arnoldo Muller:
 Jason,

 I am trying to use haskell in the analysis of bio data. One of the main
 reasons I wanted to use haskell is because lazy I/O allows you to see a
 large bio-sequence as if it was a string in memory.
 In order to achieve the same result in an imperative language I would
 have to write lots of error-prone iterators. I saw lazy I/O as a very
 strong point in favor of Haskell.

 Besides the space leaks that can occur and that are a bit difficult to
 find for a newbie like me, are there any other reasons to avoid Lazy
 I/O?

You may be happy to hear that the space leak you encountered had 
__nothing whatsoever__ to do with lazy IO.

It's true that lazy IO offers some pitfalls for the unwary (and some, but 
much fewer, for the wary), but I think the dangers of lazy IO tend to be 
exaggerated.
For your application, readFile and appendFile are absolutely fine, the 
space leak occurred in the pure code.

Below is a variant of your programme that doesn't use file-IO, the one 
readFasta function has the space leak, the currently commented-out one not.
Compile with -O2, run with e.g.

./leak +RTS -s -M400M -RTS 3 1000 10

one runs in constant space, the other not.


 Arnoldo.

--
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import Data.List
import System.Environment (getArgs)

data Chromosome =    C1
                | C2
                | C3
                | C4
                | C5
                | C6
                | C7
                | C8
                | C9
                | C10
                | C11
                | C12
                | C13
                | C14
                | C15
                | C16
                | C17
                | C18
                | C19
                | CX
                | CY
                | CMT
                  deriving (Show)

type Sequence = [Char]

data Window = Window { sequen :: Sequence,
                       chrom :: Chromosome,
                       pos   :: Int
                     }

instance Show Window where
    show w =  (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w)

main = do
       [ct, len, windowSize] - getArgs
       let wSize = (read windowSize)::Int
   ln = read len
   inData = [(cn,ln) | cn - [1 .. read ct]]
       mapM_ (uncurry $ genomeExecute filterWindow wSize)  inData

countLines :: String - Int
countLines = go 0
  where
go !acc [] = acc
go !acc ('\n':cs) = go (acc+1) cs
go !acc (_:cs ) = go acc cs

genomeExecute :: (Window - Bool) - Int - Int - Int - IO ()
genomeExecute flt wSize cn ln =
print . countLines $ fastaExtractor 
(chromosome  ++ show cn ++ ,\n ++ replicate (cn*ln) 'A') 
 wSize flt

fastaExtractor :: String - Int - (Window - Bool) - String
fastaExtractor input wSize f = printWindowList $ filter f $ readFasta 
 wSize input

filterWindow :: Window - Bool
filterWindow w = not (elem 'N' (sequen w))

printWindowList :: [Window] - String
printWindowList l = unlines $ map show l

{-
readFasta :: Int - [Char] - [Window]
readFasta windowSize sequence =
    let (header,rest) = span (/= '\n') sequence
        chr = parseChromosome header
go i (w:ws) = Window w chr i : go (i+1) ws
go _ [] = []
in go 0 $ slideWindow windowSize $ filter (/= '\n') rest
-}

readFasta :: Int - [Char] - [Window]
readFasta windowSize sequence =
    let (header,rest) = span (/= '\n') sequence
        chr = parseChromosome header
in map (\(i, w) - Window w chr i) $ zip [0..]  $ slideWindow 
windowSize  $ filter ( '\n' /= ) rest


slideWindow :: Int - [Char] - [[Char]]
slideWindow _ [] = []
slideWindow windowSize l@(_:xs)  = take windowSize l : slideWindow 
windowSize xs

parseChromosome :: [Char] - Chromosome
parseChromosome line
    | isInfixOf chromosome 1, line = C1
    | isInfixOf chromosome 2, line = C2
    | isInfixOf chromosome 3, line = C3
    | isInfixOf chromosome 4, line = C4
    | isInfixOf chromosome 5, line = C5
    | isInfixOf chromosome 6, line = C6
    | isInfixOf chromosome 7, line = C7
    | isInfixOf chromosome 8, line = C9
    | isInfixOf chromosome 10, line = C10
    | isInfixOf chromosome 11, line = C11
    | isInfixOf chromosome 12, line = C12
    | isInfixOf chromosome 13, line = C13
    | isInfixOf chromosome 14, line = C14
    | isInfixOf chromosome 15, line = C15
    | isInfixOf chromosome 16, line = C16
    | isInfixOf chromosome 17  line = C17
    | isInfixOf chromosome 18  line = C18
    | isInfixOf chromosome 19  line = C19
    | isInfixOf chromosome X   line = CX
    | isInfixOf chromosome Y   line = CY
    | isInfixOf mitochondrion  line = CMT
    | otherwise = error BAD header
--
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org

Re: [Haskell-cafe] Space leak

2010-03-11 Thread Stephen Tetley
Hi Arnoldo

This doesn't address the space leak, but your parseChromosome function
looks very inefficient - isInfixOf is repeatedly checking the prefix
chromosome for C1 to CY. If you have a lot of CY's in a file then it
will do a lot of work parsing them.

The cleanest way of handling this would be to use a parser combinator
library with keywords for chromosome and mitochondrion - however
that might add a performance penalty itself.

Here is a version that should be fairly efficient although a little
ugly due to how it has to match literal chars in prefix of the string:

Add a import for Data.Char to the import list:

 import Data.Char

Add Enum to the deriving clause of the Chromosome data type:

| C19
| CX
| CY
| CMT
  deriving (Show,Enum)


Replace parseChromosome with the one below.

Note that the derived Enum functions for Chromosome are indexed from
0.. whereas when you read one from the file it is indexed from 1.. so
you have to sub1 before using toEnum:


sub1 :: Int - Int
sub1 x = x-1

parseChromosome :: [Char] - Chromosome
parseChromosome ('c':'h':'r':'o':'m':'o':'s':'o':'m':'e':' ':xs) = chro xs
  where
   chro ('X' :_)  = CX
   chro ('Y' :_)  = CY
   chro ( x  : ',' :_)| isDigit x  = toEnum (sub1 $ digitToInt x)
   chro ('1' :  x  : ',' :_ ) | isDigit x  = toEnum (sub1 $ (10+) $
digitToInt x)
   chro ('1' :  x  :_ )   | isDigit x  = toEnum (sub1 $ (10+) $
digitToInt x)
   chro _ = error BAD header

parseChromosome ('m':'i':'t':'o':'c':'h':'o':'n':'d':'r':'i':'o':'n':_) = CMT
parseChromosome _ = error BAD header



Best wishes

Stephen
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-11 Thread Daniel Fischer
Am Donnerstag 11 März 2010 00:24:28 schrieb Daniel Fischer:
 Hmm, offhand, I don't see why that isn't strict enough.

Turns out, mapM_ was a red herring. The villain was (zip and map).
I must confess, I don't know why it sort-of worked without the mapM_, 
though. sort-of, because that also hung on to unnecessarily much memory, 
the space leak was just smaller than with the mapM_.

A very small change that eliminates the space leak, is

readFasta :: Int - [Char] - [Window]
readFasta windowSize sequence =
    -- get the header
    let (header,rest) = span (/= '\n') sequence
        chr = parseChromosome header
go i (w:ws) = Window w chr i : go (i+1) ws
go _ [] = []
in go 0 $ slideWindow windowSize $ filter (/= '\n') rest

You can improve performance by eliminating slideWindow and the intermediate 
Window list (merging fastaExtractor and readFasta), 

{-# LANGUAGE BangPatterns #-}

readFasta2 :: (String - Bool) - Int - String
readFasta2 test windowSize sequence =
let (header,rest) = span (/= '\n') sequence
chr = parseChromosome header
schr = show chr
go !i st@(_:tl)
| test w= w ++ '\t' : schr ++ '\t' : show i ++ '\n' : go 
(i+1) tl
| otherwise = go (i+1) tl
  where
w = take windowSize st
go _ [] = []
in go 0 (filter (/= '\n')) rest

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-11 Thread Arnoldo Muller
Daniel,

Thank you so much for helping me out with this issue!

Thanks to all the other answers from haskel-cafe members too!

As a newbie, I am not able to understand why zip and map would make a
problem...

Is there any link I could read that could help me to understand why in this
case
zip and map created a leak? What are some function compositions that should
be
avoided when doing lazy I/O?

Regards,

Arnoldo


On Thu, Mar 11, 2010 at 11:46 PM, Daniel Fischer
daniel.is.fisc...@web.dewrote:

 Am Donnerstag 11 März 2010 00:24:28 schrieb Daniel Fischer:
  Hmm, offhand, I don't see why that isn't strict enough.

 Turns out, mapM_ was a red herring. The villain was (zip and map).
 I must confess, I don't know why it sort-of worked without the mapM_,
 though. sort-of, because that also hung on to unnecessarily much memory,
 the space leak was just smaller than with the mapM_.

 A very small change that eliminates the space leak, is

 readFasta :: Int - [Char] - [Window]
 readFasta windowSize sequence =
 -- get the header
 let (header,rest) = span (/= '\n') sequence
 chr = parseChromosome header
go i (w:ws) = Window w chr i : go (i+1) ws
go _ [] = []
in go 0 $ slideWindow windowSize $ filter (/= '\n') rest

 You can improve performance by eliminating slideWindow and the intermediate
 Window list (merging fastaExtractor and readFasta),

 {-# LANGUAGE BangPatterns #-}

 readFasta2 :: (String - Bool) - Int - String
 readFasta2 test windowSize sequence =
let (header,rest) = span (/= '\n') sequence
chr = parseChromosome header
schr = show chr
go !i st@(_:tl)
| test w= w ++ '\t' : schr ++ '\t' : show i ++ '\n' : go
 (i+1) tl
| otherwise = go (i+1) tl
  where
w = take windowSize st
go _ [] = []
in go 0 (filter (/= '\n')) rest


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Space leak

2010-03-10 Thread Arnoldo Muller
Hello,

I am learning haskell and I found a space leak that I find difficult to
solve. I've been asking at #haskell but we could not solve
the issue.

I want to lazily read a set of 22 files of about 200MB each, filter them and
then I want to output the result into a unique file.
If I modify the main function to work only with one input file,  the program
runs without issues. I will call this version A.
Version B  uses a mapM_ to iterate over a list of filenames and uses
appendFile to output the result of filtering each file.
In this case the memory usage grows sharply and quickly (profiles show
constant memory growth). In less than a minute, memory
occupation will make my system hang with swapping.

This is version B:

--- Program B

import Data.List
import System.Environment
import System.Directory
import Control.Monad


-- different types of chromosomes
data Chromosome =C1
| C2
| C3
| C4
| C5
| C6
| C7
| C8
| C9
| C10
| C11
| C12
| C13
| C14
| C15
| C16
| C17
| C18
| C19
| CX
| CY
| CMT
  deriving (Show)
-- define a window
type Sequence = [Char]
-- Window data
data Window = Window { sequen :: Sequence,
   chrom :: Chromosome,
   pos   :: Int
 }
-- print a window
instance Show Window where
show w =  (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w)

-- Reading fasta files with haskell

-- Initialize the
main = do
   -- get the arguments (intput is
   [input, output, windowSize] - getArgs
   -- get directory contents (only names)
   names - getDirectoryContents input
   -- prepend directory
   let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x)
names
   let wSize = (read windowSize)::Int
   -- process the directories
   mapM (genomeExecute output wSize filterWindow)  fullNames


-- read the files one by one and write them to the output file
genomeExecute :: String - Int - (Window - Bool) - String - IO ()
genomeExecute  outputFile windowSize f inputFile = do
  fileData - readFile inputFile
  appendFile outputFile $ fastaExtractor fileData windowSize f

-- 
isFastaFile :: String - Bool
isFastaFile fileName = isSuffixOf .fa fileName


-- fasta extractor (receives a Fasta String and returns a windowed string
ready to be sorted)
-- an example on how to compose several functions to parse a fasta file
fastaExtractor :: String - Int - (Window - Bool) - String
fastaExtractor input wSize f = printWindowList $ filter f $ readFasta  wSize
input

-- MAIN FILTER that removes N elements from the strings!
filterWindow :: Window - Bool
filterWindow w = not (elem 'N' (sequen w))

-- print a window list (the printing makes it ready for output as raw data)
printWindowList :: [Window] - String
printWindowList l = unlines $ map show l

-- read fasta, remove stuff that is not useful from it
-- removes the
readFasta :: Int - [Char] - [Window]
readFasta windowSize sequence =
-- get the header
let (header:rest) = lines sequence
chr = parseChromosome header
in

-- We now do the following:
--  take window  create counter
remove newlines
   map (\(i, w) - Window w chr i) $ zip [0..]  $ slideWindow windowSize  $
filter ( '\n' /= )  $ unlines rest


slideWindow :: Int - [Char] - [[Char]]
slideWindow _ [] = []
slideWindow windowSize l@(_:xs)  = take windowSize l : slideWindow
windowSize xs



-- Parse the chromosome from a fasta comment
-- produce a more compact chromosome representation
parseChromosome :: [Char] - Chromosome
parseChromosome line
| isInfixOf chromosome 1, line = C1
| isInfixOf chromosome 2, line = C2
| isInfixOf chromosome 3, line = C3
| isInfixOf chromosome 4, line = C4
| isInfixOf chromosome 5, line = C5
| isInfixOf chromosome 6, line = C6
| isInfixOf chromosome 7, line = C7
| isInfixOf chromosome 8, line = C9
| isInfixOf chromosome 10, line = C10
| isInfixOf chromosome 11, line = C11
| isInfixOf chromosome 12, line = C12
| isInfixOf chromosome 13, line = C13
| isInfixOf chromosome 14, line = C14
| isInfixOf chromosome 15, line = C15
| isInfixOf chromosome 16, line = C16
| isInfixOf chromosome 17  line = C17
| isInfixOf chromosome 18  line = C18
| isInfixOf chromosome 19  line = C19
| isInfixOf chromosome X   line = CX
| isInfixOf chromosome Y   line = CY
| isInfixOf mitochondrion  line = CMT
| otherwise = error BAD header


 End of program B

Re: [Haskell-cafe] Space leak

2010-03-10 Thread Bulat Ziganshin
Hello Arnoldo,

Wednesday, March 10, 2010, 11:45:56 PM, you wrote:

 I am learning haskell and I found a space leak that I find
 difficult to solve. I've been asking at #haskell but we could not solve
 the issue.

make some experiments - leave only one file and use version A, then
replace appendFile with writeFile

-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-10 Thread Daniel Fischer
Am Mittwoch 10 März 2010 21:45:56 schrieb Arnoldo Muller:
 Hello,

 I am learning haskell and I found a space leak that I find difficult to
 solve. I've been asking at #haskell but we could not solve
 the issue.

 I want to lazily read a set of 22 files of about 200MB each, filter them
 and then I want to output the result into a unique file.
 If I modify the main function to work only with one input file,  the
 program runs without issues. I will call this version A.
 Version B  uses a mapM_ to iterate over a list of filenames and uses
 appendFile to output the result of filtering each file.
 In this case the memory usage grows sharply and quickly (profiles show
 constant memory growth). In less than a minute, memory
 occupation will make my system hang with swapping.

No work is been done until the end, when all is tried to be done 
simultaneously. Make sure genomeExecute ... input1 has actually finished 
its work before genomeExecute ... input2 starts etc.

One way is to use a stricter version of sequence_,

sequence'_ :: Monad m = [m a] - m ()
sequence'_ (x:xs) = do
a - x
a `seq` sequence'_ xs
sequence'_ [] = return ()

(nicer with BangPatterns, but not portable), and

mapM'_ f = sequence'_ . map f

Another option is making genomeExecute itself stricter.


 This is version B:

 --- Program B
 
 import Data.List
 import System.Environment
 import System.Directory
 import Control.Monad


 -- different types of chromosomes
 data Chromosome =C1

 | C2
 | C3
 | C4
 | C5
 | C6
 | C7
 | C8
 | C9
 | C10
 | C11
 | C12
 | C13
 | C14
 | C15
 | C16
 | C17
 | C18
 | C19
 | CX
 | CY
 | CMT

   deriving (Show)
 -- define a window
 type Sequence = [Char]
 -- Window data
 data Window = Window { sequen :: Sequence,
chrom :: Chromosome,
pos   :: Int
  }
 -- print a window
 instance Show Window where
 show w =  (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos
 w)

 -- Reading fasta files with haskell

 -- Initialize the
 main = do
-- get the arguments (intput is
[input, output, windowSize] - getArgs
-- get directory contents (only names)
names - getDirectoryContents input
-- prepend directory
let fullNames = filter isFastaFile $ map (\x - input ++ / ++
 x) names
let wSize = (read windowSize)::Int
-- process the directories
mapM (genomeExecute output wSize filterWindow)  fullNames


 -- read the files one by one and write them to the output file
 genomeExecute :: String - Int - (Window - Bool) - String - IO ()
 genomeExecute  outputFile windowSize f inputFile = do
   fileData - readFile inputFile
   appendFile outputFile $ fastaExtractor fileData windowSize f


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-10 Thread Arnoldo Muller
Hello Daniel:

Thanks!
I employed mapM'_ but I am still getting the space leak.
Any other hint?



Arnoldo

On Wed, Mar 10, 2010 at 10:40 PM, Daniel Fischer
daniel.is.fisc...@web.dewrote:

 Am Mittwoch 10 März 2010 21:45:56 schrieb Arnoldo Muller:
  Hello,
 
  I am learning haskell and I found a space leak that I find difficult to
  solve. I've been asking at #haskell but we could not solve
  the issue.
 
  I want to lazily read a set of 22 files of about 200MB each, filter them
  and then I want to output the result into a unique file.
  If I modify the main function to work only with one input file,  the
  program runs without issues. I will call this version A.
  Version B  uses a mapM_ to iterate over a list of filenames and uses
  appendFile to output the result of filtering each file.
  In this case the memory usage grows sharply and quickly (profiles show
  constant memory growth). In less than a minute, memory
  occupation will make my system hang with swapping.

 No work is been done until the end, when all is tried to be done
 simultaneously. Make sure genomeExecute ... input1 has actually finished
 its work before genomeExecute ... input2 starts etc.

 One way is to use a stricter version of sequence_,

 sequence'_ :: Monad m = [m a] - m ()
 sequence'_ (x:xs) = do
a - x
a `seq` sequence'_ xs
 sequence'_ [] = return ()

 (nicer with BangPatterns, but not portable), and

 mapM'_ f = sequence'_ . map f

 Another option is making genomeExecute itself stricter.

 
  This is version B:
 
  --- Program B
  
  import Data.List
  import System.Environment
  import System.Directory
  import Control.Monad
 
 
  -- different types of chromosomes
  data Chromosome =C1
 
  | C2
  | C3
  | C4
  | C5
  | C6
  | C7
  | C8
  | C9
  | C10
  | C11
  | C12
  | C13
  | C14
  | C15
  | C16
  | C17
  | C18
  | C19
  | CX
  | CY
  | CMT
 
deriving (Show)
  -- define a window
  type Sequence = [Char]
  -- Window data
  data Window = Window { sequen :: Sequence,
 chrom :: Chromosome,
 pos   :: Int
   }
  -- print a window
  instance Show Window where
  show w =  (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos
  w)
 
  -- Reading fasta files with haskell
 
  -- Initialize the
  main = do
 -- get the arguments (intput is
 [input, output, windowSize] - getArgs
 -- get directory contents (only names)
 names - getDirectoryContents input
 -- prepend directory
 let fullNames = filter isFastaFile $ map (\x - input ++ / ++
  x) names
 let wSize = (read windowSize)::Int
 -- process the directories
 mapM (genomeExecute output wSize filterWindow)  fullNames
 
 
  -- read the files one by one and write them to the output file
  genomeExecute :: String - Int - (Window - Bool) - String - IO ()
  genomeExecute  outputFile windowSize f inputFile = do
fileData - readFile inputFile
appendFile outputFile $ fastaExtractor fileData windowSize f



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-10 Thread Arnoldo Muller
Hello Bulat,

I ran program A with writeFile instead of appendFile and it still works
without problems.
Regarding program B, if I use writeFile the leaking still occurs.

Any other hints? :)

Arnoldo

On Wed, Mar 10, 2010 at 10:32 PM, Bulat Ziganshin bulat.zigans...@gmail.com
 wrote:

 Hello Arnoldo,

 Wednesday, March 10, 2010, 11:45:56 PM, you wrote:

  I am learning haskell and I found a space leak that I find
  difficult to solve. I've been asking at #haskell but we could not solve
  the issue.

 make some experiments - leave only one file and use version A, then
 replace appendFile with writeFile

 --
 Best regards,
  Bulatmailto:bulat.zigans...@gmail.com


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-10 Thread Arnoldo Muller
Hello Justin,

I tried and what I saw was a constant increase in memory usage.
Any particular profiling option that you would use?

I do remember that there was a particular option in which the leak would
dissapear (for the same amount of work) and that is why I stopped using the
profiler.

Thanks,

Arnoldo


On Wed, Mar 10, 2010 at 10:20 PM, Justin Bailey jgbai...@gmail.com wrote:

 Have you use the profiling tools available with GHC?

  http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html


 On Wed, Mar 10, 2010 at 12:45 PM, Arnoldo Muller
 arnoldomul...@gmail.com wrote:
  Hello,
 
  I am learning haskell and I found a space leak that I find difficult to
  solve. I've been asking at #haskell but we could not solve
  the issue.
 
  I want to lazily read a set of 22 files of about 200MB each, filter them
 and
  then I want to output the result into a unique file.
  If I modify the main function to work only with one input file,  the
 program
  runs without issues. I will call this version A.
  Version B  uses a mapM_ to iterate over a list of filenames and uses
  appendFile to output the result of filtering each file.
  In this case the memory usage grows sharply and quickly (profiles show
  constant memory growth). In less than a minute, memory
  occupation will make my system hang with swapping.
 
  This is version B:
 
  --- Program B
 
 
  import Data.List
  import System.Environment
  import System.Directory
  import Control.Monad
 
 
  -- different types of chromosomes
  data Chromosome =C1
  | C2
  | C3
  | C4
  | C5
  | C6
  | C7
  | C8
  | C9
  | C10
  | C11
  | C12
  | C13
  | C14
  | C15
  | C16
  | C17
  | C18
  | C19
  | CX
  | CY
  | CMT
deriving (Show)
  -- define a window
  type Sequence = [Char]
  -- Window data
  data Window = Window { sequen :: Sequence,
 chrom :: Chromosome,
 pos   :: Int
   }
  -- print a window
  instance Show Window where
  show w =  (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos
 w)
 
  -- Reading fasta files with haskell
 
  -- Initialize the
  main = do
 -- get the arguments (intput is
 [input, output, windowSize] - getArgs
 -- get directory contents (only names)
 names - getDirectoryContents input
 -- prepend directory
 let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x)
  names
 let wSize = (read windowSize)::Int
 -- process the directories
 mapM (genomeExecute output wSize filterWindow)  fullNames
 
 
  -- read the files one by one and write them to the output file
  genomeExecute :: String - Int - (Window - Bool) - String - IO ()
  genomeExecute  outputFile windowSize f inputFile = do
fileData - readFile inputFile
appendFile outputFile $ fastaExtractor fileData windowSize f
 
  --
  isFastaFile :: String - Bool
  isFastaFile fileName = isSuffixOf .fa fileName
 
 
  -- fasta extractor (receives a Fasta String and returns a windowed string
  ready to be sorted)
  -- an example on how to compose several functions to parse a fasta file
  fastaExtractor :: String - Int - (Window - Bool) - String
  fastaExtractor input wSize f = printWindowList $ filter f $ readFasta
 wSize
  input
 
  -- MAIN FILTER that removes N elements from the strings!
  filterWindow :: Window - Bool
  filterWindow w = not (elem 'N' (sequen w))
 
  -- print a window list (the printing makes it ready for output as raw
 data)
  printWindowList :: [Window] - String
  printWindowList l = unlines $ map show l
 
  -- read fasta, remove stuff that is not useful from it
  -- removes the
  readFasta :: Int - [Char] - [Window]
  readFasta windowSize sequence =
  -- get the header
  let (header:rest) = lines sequence
  chr = parseChromosome header
  in
 
  -- We now do the following:
  --  take window  create counter
  remove newlines
 map (\(i, w) - Window w chr i) $ zip [0..]  $ slideWindow windowSize
 $
  filter ( '\n' /= )  $ unlines rest
 
 
  slideWindow :: Int - [Char] - [[Char]]
  slideWindow _ [] = []
  slideWindow windowSize l@(_:xs)  = take windowSize l : slideWindow
  windowSize xs
 
 
 
  -- Parse the chromosome from a fasta comment
  -- produce a more compact chromosome representation
  parseChromosome :: [Char] - Chromosome
  parseChromosome line
  | isInfixOf chromosome 1, line = C1
  | isInfixOf chromosome 2, line = C2
  | isInfixOf chromosome 3, line 

Re: [Haskell-cafe] Space leak

2010-03-10 Thread Bulat Ziganshin
Hello Arnoldo,

Wednesday, March 10, 2010, 11:45:56 PM, you wrote:

 I am learning haskell and I found a space leak that I find
 difficult to solve. I've been asking at #haskell but we could not solve
 the issue.

what if you use program B on single file?


-- 
Best regards,
 Bulatmailto:bulat.zigans...@gmail.com

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-10 Thread Arnoldo Muller
Bulat,

The same happens, the memory starts to quickly fill up...

Arnoldo

On Wed, Mar 10, 2010 at 11:16 PM, Bulat Ziganshin bulat.zigans...@gmail.com
 wrote:

 Hello Arnoldo,

 Wednesday, March 10, 2010, 11:45:56 PM, you wrote:

  I am learning haskell and I found a space leak that I find
  difficult to solve. I've been asking at #haskell but we could not solve
  the issue.

 what if you use program B on single file?


 --
 Best regards,
  Bulatmailto:bulat.zigans...@gmail.com


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-10 Thread Daniel Fischer
Am Mittwoch 10 März 2010 23:01:28 schrieb Arnoldo Muller:
 Hello Daniel:

 Thanks!
 I employed mapM'_ but I am still getting the space leak.
 Any other hint?


Hmm, offhand, I don't see why that isn't strict enough.
With some datafiles, I could try to investigate.

One question, how does programme C with

main = do
   [input, output, windowSize] - getArgs
   let wSize = (read windowSize)::Int
   genomeExecute output wSize filterWindow input

behave? Space leak or not?

But yes, a few other hints I have (though they're not likely to squash the 
space leak).

Generally, ByteString IO  is often orders of magnitude faster than String 
IO and uses much less memory, so using (lazy) ByteStrings is worthy of 
consideration.



 Arnoldo

-- define a window
type Sequence = [Char]
-- Window data
data Window = Window { sequen :: Sequence,
                       chrom :: Chromosome,
                       pos   :: Int
                     }

-- print a window
instance Show Window where
    show w =  (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w)

-- Reading fasta files with haskell

-- Initialize the
main = do
       -- get the arguments (intput is
       [input, output, windowSize] - getArgs
       -- get directory contents (only names)
       names - getDirectoryContents input
       -- prepend directory
       let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x) 
names

*
let fullNames = map ((input ++) . (/ ++)) $ filter isFastaFile names

saves a little work
*

       let wSize = (read windowSize)::Int
       -- process the directories
       mapM (genomeExecute output wSize filterWindow)  fullNames


-- read the files one by one and write them to the output file
genomeExecute :: String - Int - (Window - Bool) - String - IO ()
genomeExecute  outputFile windowSize f inputFile = do
  fileData - readFile inputFile
  appendFile outputFile $ fastaExtractor fileData windowSize f

*
The arguments of fastaExtractor should be reversed, then

genomeExecute outputFile windowSize f inputFile
= appendFile outputFile . fastaExtractor' f windowSize 
   = readFile inputFile
*

-- 
isFastaFile :: String - Bool
isFastaFile fileName = isSuffixOf .fa fileName


-- fasta extractor (receives a Fasta String and returns a windowed string
ready to be sorted)
-- an example on how to compose several functions to parse a fasta file
fastaExtractor :: String - Int - (Window - Bool) - String
fastaExtractor input wSize f = printWindowList $ filter f $ readFasta 
 wSize
input

*
fastaExtractor' f wSize = printWindowList . filter f . readFasta wSize
*


-- MAIN FILTER that removes N elements from the strings!
filterWindow :: Window - Bool
filterWindow w = not (elem 'N' (sequen w))

*
filterWindow w = 'N' `notElem` sequen w
*

-- print a window list (the printing makes it ready for output as raw data)
printWindowList :: [Window] - String
printWindowList l = unlines $ map show l

-- read fasta, remove stuff that is not useful from it
-- removes the
readFasta :: Int - [Char] - [Window]
readFasta windowSize sequence =
    -- get the header
    let (header:rest) = lines sequence
        chr = parseChromosome header
        in

-- We now do the following:
--      take window                  create counter
remove newlines
   map (\(i, w) - Window w chr i) $ zip [0..]  $ slideWindow windowSize  $
filter ( '\n' /= )  $ unlines rest

*
filter ('\n' /=) . unlines

is odd. What about concat? Or

readFasta wSize chrseq
   = case span (/= '\n') chrseq of
   (header, _:rest) -
   let chr = parseChromosome header
   in map (\(i,w) - Window w chr i) . zip [0 .. ] . 
  slideWindow wSize $ filter (/= '\n') rest
   _ - []

if your input file format had no other newline than the one between header 
and body, that'd be nice.
*

slideWindow :: Int - [Char] - [[Char]]
slideWindow _ [] = []
slideWindow windowSize l@(_:xs)  = take windowSize l : slideWindow
windowSize xs

*
slideWindow wSize = map (take wSize) . tails
*

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space Leak with semi-implicit parallelization and the nasty Garbage collector

2009-12-24 Thread Michael Lesniak
Hello Daniel,

thanks for your fast response. That's strange: On your system total
time elapsed according to GHC is ~190%, on mine (reproducible!) ~140%
(see below). I once had a problem with a particular linux kernel[1],
unfortunately I currently (over the holidays) have no other computers
available.

Does anyone also have an up-to-date ubuntu karmic 9.10 with my kernel
version to confirm or refute the problem's dependency on the linux
kernel?

A bit irritated, but still in an Haskelllish christmas mood,
  Michael

[1] http://www.mail-archive.com/haskell-cafe@haskell.org/msg67148.html


 time exa +RTS -N2 -ls -sstderr
exa +RTS -N2 -ls -sstderr
1
3
0
0
9
3
6
9
1
8
8
9
2
5
5
8
6
5
7
8
4
  72,748,227,888 bytes allocated in the heap
  65,331,056 bytes copied during GC
 183,032 bytes maximum residency (22 sample(s))
 209,352 bytes maximum slop
   4 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0: 131339 collections, 131338 parallel,  7.29s,  6.25s elapsed
  Generation 1:22 collections,22 parallel,  0.03s,  0.03s elapsed

  Parallel GC work balance: 1.10 (7778368 / 7059369, ideal 2)

MUT time (elapsed)   GC time  (elapsed)
  Task  0 (worker) :   66.38s( 36.22s)   1.75s(  1.37s)
  Task  1 (worker) :   66.40s( 36.25s)   0.00s(  0.00s)
  Task  2 (worker) :   68.13s( 36.25s)   0.00s(  0.00s)
  Task  3 (worker) :   62.56s( 36.25s)   5.57s(  4.91s)

  SPARKS: 21 (21 converted, 0 pruned)

  INIT  time0.00s  (  0.01s elapsed)
  MUT   time   60.81s  ( 36.25s elapsed)
  GCtime7.32s  (  6.28s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time   66.40s  ( 42.54s elapsed)

  %GC time  11.0%  (14.8% elapsed)

  Alloc rate1,231,351,182 bytes per MUT second

  Productivity  89.0% of total user, 138.9% of total elapsed

gc_alloc_block_sync: 61137
whitehole_spin: 0
gen[0].steps[0].sync_large_objects: 17681
gen[0].steps[1].sync_large_objects: 20647
gen[1].steps[0].sync_large_objects: 0

real0m42.626s
user1m6.400s
sys 0m0.510s


2009/12/24 Daniel Fischer daniel.is.fisc...@web.de:
 Am Donnerstag 24 Dezember 2009 02:14:51 schrieb Michael Lesniak:

 Hello haskell-cafe (and merry christmas!),



 I have a strange problem with the garbage collector / memory which I'm

 unable to find a solution for. I think the source of my problems has to do

 with lazy evaluation, but currently I'm unable to find it.



 Using the attached program and threadscope I see that the GC is using a
 lot

 of time and the program comes (more or less) to a halt (see exa-1.png).

 When I increase the heap the program takes much longer and the GC is

 running more or less all the time (see exa-2.png).



 Some more detailled information:



 * I can see the described behaviour under both GHC 10.4 and GHC 12.1

 * Linux kernel 2.6.31-16 on a dualcore

 * Program compiled with ghc --make -O2 -threaded -eventlog Example.hs -o

 exa * Started with exa +RTS -ls and one of { -N, -N1, -N2 }



 Any help (pointing into the right direction, mention possibly helpful

 blog articles

 or paper, constructive critic in general) is appreciated!

 I can't reproduce that (ghc-6.12.1, Linux linux-mkk1 2.6.27.39-0.2-pae #1
 SMP 2009-11-23 12:57:38 +0100 i686 i686 i386 GNU/Linux, dual core):

 $ time ./exa +RTS -ls -N2 -sstderr

 ./exa +RTS -ls -N2 -sstderr

 1

 3

 0

 0

 9

 3

 6

 9

 1

 8

 8

 9

 2

 5

 5

 8

 6

 5

 7

 8

 4

 72,499,126,908 bytes allocated in the heap

 45,280,708 bytes copied during GC

 177,504 bytes maximum residency (10 sample(s))

 183,844 bytes maximum slop

 4 MB total memory in use (1 MB lost due to fragmentation)

 Generation 0: 131527 collections, 131526 parallel, 7.18s, 3.88s elapsed

 Generation 1: 10 collections, 10 parallel, 0.00s, 0.00s elapsed

 Parallel GC work balance: 1.16 (10931266 / 9433437, ideal 2)

 MUT time (elapsed) GC time (elapsed)

 Task 0 (worker) : 115.10s (126.56s) 3.34s ( 1.84s)

 Task 1 (worker) : 124.21s (126.56s) 3.84s ( 2.04s)

 Task 2 (worker) : 0.09s (126.56s) 0.00s ( 0.00s)

 Task 3 (worker) : 0.00s (126.56s) 0.00s ( 0.00s)

 SPARKS: 21 (21 converted, 0 pruned)

 INIT time 0.00s ( 0.13s elapsed)

 MUT time 238.05s (126.56s elapsed)

 GC time 7.19s ( 3.88s elapsed)

 EXIT time 0.00s ( 0.00s elapsed)

 Total time 244.46s (130.57s elapsed)

 %GC time 2.9% (3.0% elapsed)

 Alloc rate 305,559,453 bytes per MUT second

 Productivity 97.1% of total user, 181.7% of total elapsed

 gc_alloc_block_sync: 151252

 whitehole_spin: 0

 gen[0].steps[0].sync_large_objects: 75620

 gen[0].steps[1].sync_large_objects: 9662

 gen[1].steps[0].sync_large_objects: 0

 244.45user 2.06system 2:10.58elapsed 188%CPU (0avgtext+0avgdata
 0maxresident)k

 0inputs+35736outputs (0major+2426minor)pagefaults 0swaps

 Garbage collection isn't even visible in the threadscope profile.

 With -N1:

 71,999,280,108 bytes allocated in the heap

 20,729,380 bytes 

Re: [Haskell-cafe] Space Leak with semi-implicit parallelization and the nasty Garbage collector

2009-12-23 Thread Daniel Fischer
Am Donnerstag 24 Dezember 2009 02:14:51 schrieb Michael Lesniak:
 Hello haskell-cafe (and merry christmas!),

 I have a strange problem with the garbage collector / memory which I'm
 unable to find a solution for. I think the source of my problems has to do
 with lazy evaluation, but currently I'm unable to find it.

 Using the attached program and threadscope I see that the GC is using a lot
 of time and the program comes (more or less) to a halt (see exa-1.png).
 When I increase the heap the program takes much longer and the GC is
 running more or less all the time (see exa-2.png).

 Some more detailled information:

 * I can see the described behaviour under both GHC 10.4 and GHC 12.1
 * Linux kernel 2.6.31-16 on a dualcore
 * Program compiled with ghc --make -O2 -threaded -eventlog Example.hs -o
 exa * Started with exa +RTS -ls and one of { -N, -N1, -N2 }

 Any help (pointing into the right direction, mention possibly helpful
 blog articles
 or paper, constructive critic in general) is appreciated!

I can't reproduce that  (ghc-6.12.1, Linux linux-mkk1 2.6.27.39-0.2-pae #1 SMP 
2009-11-23 
12:57:38 +0100 i686 i686 i386 GNU/Linux, dual core):

$  time ./exa +RTS -ls -N2 -sstderr 
./exa +RTS -ls -N2 -sstderr  
1
3
0
0
9
3
6
9
1
8
8
9
2
5
5
8
6
5
7
8
4
  72,499,126,908 bytes allocated in the heap 
  45,280,708 bytes copied during GC  
 177,504 bytes maximum residency (10 sample(s))
 183,844 bytes maximum slop
   4 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0: 131527 collections, 131526 parallel,  7.18s,  3.88s elapsed
  Generation 1:10 collections,10 parallel,  0.00s,  0.00s elapsed

  Parallel GC work balance: 1.16 (10931266 / 9433437, ideal 2)

MUT time (elapsed)   GC time  (elapsed)
  Task  0 (worker) :  115.10s(126.56s)   3.34s(  1.84s)
  Task  1 (worker) :  124.21s(126.56s)   3.84s(  2.04s)
  Task  2 (worker) :0.09s(126.56s)   0.00s(  0.00s)
  Task  3 (worker) :0.00s(126.56s)   0.00s(  0.00s)

  SPARKS: 21 (21 converted, 0 pruned)

  INIT  time0.00s  (  0.13s elapsed)
  MUT   time  238.05s  (126.56s elapsed)
  GCtime7.19s  (  3.88s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time  244.46s  (130.57s elapsed)

  %GC time   2.9%  (3.0% elapsed)

  Alloc rate305,559,453 bytes per MUT second

  Productivity  97.1% of total user, 181.7% of total elapsed

gc_alloc_block_sync: 151252
whitehole_spin: 0
gen[0].steps[0].sync_large_objects: 75620
gen[0].steps[1].sync_large_objects: 9662
gen[1].steps[0].sync_large_objects: 0
244.45user 2.06system 2:10.58elapsed 188%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+35736outputs (0major+2426minor)pagefaults 0swaps

Garbage collection isn't even visible in the threadscope profile.

With -N1:
  71,999,280,108 bytes allocated in the heap
  20,729,380 bytes copied during GC
  92,492 bytes maximum residency (2 sample(s))
 190,872 bytes maximum slop
   3 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 130901 collections, 0 parallel,  2.64s,  2.68s elapsed
  Generation 1: 2 collections, 0 parallel,  0.00s,  0.00s 

[Haskell-cafe] space leak hints?

2009-07-03 Thread Uwe Hollerbach
Good evening, all, I wonder if I could tap your collective wisdom
regarding space leaks? I've been messing about with haskeem, my little
scheme interpreter, and I decided to see if I could make it run
reasonably space-efficiently. So far... no.

Here's what I tried: I wrote a tiny scheme program to compute Collatz
sequences for successive numbers, starting from 1 and incrementing
forever (well, in principle). Because real scheme implementations are
fully tail-call-optimized, this'll run in constant memory; I checked
that with mzscheme, and it does indeed work. With my little
interpreter, that's not the case: memory usage grows continually,
although apparently less-than-linearly. I've built the interpreter
with the profiling stuff described on the wiki and in RWH Ch 25 turned
on and have made a few runs with that; I stuck the postscript plot
that's the result of one of those runs onto my web site at
http://www.korgwal.com/haskeem/run2.ps.

The full source to the interpreter is a little large to paste into
this message; it's available on my web site, and also on hackage. But
according to the plot, there appear to be three main memory allocation
issues, and they seem to all be related, if I'm reading stuff
correctly. The core of the interpreter is a function, evalLisp, which
evaluates scheme forms. There are of course a fair number of different
forms, but the largest generic usage is evaluate each of a list of
forms, returning the value of the last of them as the overall result.
In order to express that in a couple of different places, and to
accomodate the possibility of an empty list, I have a really tiny
function lastOrNil which just calls last on a (haskell) list,
checking for the possibility of an empty list, and returning a haskeem
LispVal object:

 lastOrNil = liftM lON
   where lON [] = List []
lON l = last l

(sorry, proportional fonting may be throwing this off).

It's this function which is directly implicated in two of the top
three memory pigs, and nearly directly in the third one as well. If I
could eliminate the memory growth in these three cost centers, I would
already capture over 90% of the growth in this benchmark, which would
make me very happy indeed. But I don't really understand what is going
on here. It seems entirely plausible, indeed likely, that the list
which I'm traversing there is not fully evaluated. So I've tried
adding 'seq' to this function. Uhhh... from memory, I dumped it after
it didn't work, I had

 lastOrNil = liftM lON
   where lON [] = List []
   lON (l:[]) = l
   lON (l:ls) = seq l (lON ls)

(again, proportional fonting might mess me up here.)

As I said, I dumped this after it made no difference whatsoever. I
also tried bang-patterns in a couple of places, strictness annotation
of my basic LispVal types... nothing. It all made exactly no
difference, as far as I could tell. I've tried a couple of google
searches, but haven't come up with anything better than what I've
already described. So I'm a stumped chump!

I'd be grateful for any suggestions... thanks in advance!

Uwe
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] space leak with 'concat' ?

2009-01-27 Thread Henning Thielemann


$ ghc +RTS -M16m -c30 -RTS -e 'concat $ repeat bla'

This breaks down after a while, also if I increase the memory restriction:

...
ablablablablablablablablablablablablablablablablablablablablablaHeap exhausted;
Current maximum heap size is 15998976 bytes (15 Mb);
use `+RTS -Msize' to increase it.


Whereas this one works:

$ ghc +RTS -M16m -c30 -RTS -e 'cycle bla'



'concat' seems to be the culprit. What's so bad about it?
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] space leak with 'concat' ?

2009-01-27 Thread Jonathan Cast
On Tue, 2009-01-27 at 22:12 +0100, Henning Thielemann wrote:
 $ ghc +RTS -M16m -c30 -RTS -e 'concat $ repeat bla'
 
 This breaks down after a while, also if I increase the memory restriction:
 
 ...
 ablablablablablablablablablablablablablablablablablablablablablaHeap 
 exhausted;
 Current maximum heap size is 15998976 bytes (15 Mb);
 use `+RTS -Msize' to increase it.
 
 
 Whereas this one works:
 
 $ ghc +RTS -M16m -c30 -RTS -e 'cycle bla'

 'concat' seems to be the culprit. What's so bad about it?

cycle builds a circular list:

  cycle xn = let ys = xn ++ ys in ys

which concat isn't smart enough to do (obviously).  That's not an issue
normally, since the list is being consumed in a properly lazy fashion.
But it looks like, for some reason, ghc -e is holding a pointer to the
entire expression-to-evaluate in this case, so cons cells that have
already been printed don't get garbage collected.  To show that there's
nothing wrong with concat per se, try this version instead:

  ghc +RTS -M16m -c30 -RTS -e 'print $ concat $ repeat bla'

This should print forever without any problems.

jcc


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] space leak with 'concat' ?

2009-01-27 Thread Henning Thielemann


On Tue, 27 Jan 2009, Jonathan Cast wrote:

To show that there's nothing wrong with concat per se, try this version 
instead:


 ghc +RTS -M16m -c30 -RTS -e 'print $ concat $ repeat bla'

This should print forever without any problems.


You are right, this works. My example was extracted from a larger module. 
But in that module I defined the text to be printed as top-level variable 
which might have been the problem. But this can't be the problem of the 
compiled version of the program, where I encountered the leak. So I have 
to keep on searching that leak.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] space leak with 'concat' ?

2009-01-27 Thread Henning Thielemann


On Tue, 27 Jan 2009, Henning Thielemann wrote:



On Tue, 27 Jan 2009, Jonathan Cast wrote:

To show that there's nothing wrong with concat per se, try this version 
instead:


 ghc +RTS -M16m -c30 -RTS -e 'print $ concat $ repeat bla'

This should print forever without any problems.


You are right, this works. My example was extracted from a larger module. But 
in that module I defined the text to be printed as top-level variable which 
might have been the problem. But this can't be the problem of the compiled 
version of the program, where I encountered the leak. So I have to keep on 
searching that leak.


Yes, the top-level variables seem to be the leak. I had to turn them into 
functions with dummy arguments _and_ attach INLINE pragmas to them in 
order to keep GHC away from buffering their content. Is there a less ugly 
way to achieve this?

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] space leak with 'concat' ?

2009-01-27 Thread Jake McArthur

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Henning Thielemann wrote:
| in that module I defined the text to be printed as top-level
| variable which might have been the problem. But this can't be the
| problem of the compiled version of the program, where I encountered the
| leak. So I have to keep on searching that leak.

You have created a constant applicative form (commonly abbreviated CAF).
GHC assumes that all top level declarations are constants, and simply
does not garbage collect them. In the case of infinite structures, this
can be a bad thing. This *does* affect even compiled code.

The best way to avoid the problem, of course, is to avoid having
infinite constants at the top level. Assuming that is impossible, your
solution seems acceptable to me. Somebody more knowledgeable or creative
than I may be able to come up with something nicer, though.

- - Jake
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkl/lz0ACgkQye5hVyvIUKl0JgCgx5ddBc0Y44+ghFakr7Mex1RP
zfUAnjh9BDI5+A9tEnaox20DbXbipX33
=2MCw
-END PGP SIGNATURE-
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] space leak with 'concat' ?

2009-01-27 Thread Sterling Clover
Note that only monomorphic declarations are CAFed. If you have an explicit
polymorphic signature, it will be treated as a function and
garbage-collected as usual. So if you have, e.g., a list of Doubles,
declaring it as foo :: Num a = [a] would do the trick.
Cheers,
S.

On Tue, Jan 27, 2009 at 6:22 PM, Jake McArthur j...@pikewerks.com wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Henning Thielemann wrote:
 | in that module I defined the text to be printed as top-level
 | variable which might have been the problem. But this can't be the
 | problem of the compiled version of the program, where I encountered the
 | leak. So I have to keep on searching that leak.

 You have created a constant applicative form (commonly abbreviated CAF).
 GHC assumes that all top level declarations are constants, and simply
 does not garbage collect them. In the case of infinite structures, this
 can be a bad thing. This *does* affect even compiled code.

 The best way to avoid the problem, of course, is to avoid having
 infinite constants at the top level. Assuming that is impossible, your
 solution seems acceptable to me. Somebody more knowledgeable or creative
 than I may be able to come up with something nicer, though.

 - - Jake
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iEYEARECAAYFAkl/lz0ACgkQye5hVyvIUKl0JgCgx5ddBc0Y44+ghFakr7Mex1RP
 zfUAnjh9BDI5+A9tEnaox20DbXbipX33
 =2MCw
 -END PGP SIGNATURE-

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Space leak with Data.Binary and decodeFile

2009-01-11 Thread Maxime Henrion
Hello all,


I've been observing a huge space leak with some code using Data.Binary
that I cannot make sense of and I hope someone here can shed some light
on this, so I'll try to explain my problem as clearly as possible.  I
qualify the space leak as huge because if I let the program run, it will
soon consume the whole memory available (~3G) and finally will get
killed by the system.

The code I'm writing implements a search algorithm using an inverted
index.  This index is built from a Trie [Int] (from the bytestring-trie
package) and an Array Int ByteString.  The trie maps each referenced
word to an integer list that is a list of indices into the array.  Here
is the code for the Index datatype and the obvious Binary instance:

data Index = Index { entries :: Array Int ByteString
   , invidx  :: Trie [Int
   }

instance Binary Index where
put (Index dirs idx) = put dirs  put idx
get = liftM2 get get

I have no problems creating and seralizing this data structure to a
file.  The huge leak appears instead when I'm reading this data
structure from a file and try to do something with it.

This is the smallest test case I came up with that can reproduce the
problem :

main = do idx - decodeFile list.idx; mapM_ (B.putStrLn . snd) (assocs
(entries idx))

The space leak also appears when I try to touch the trie instead of the
array.  I've been trying tons of combinations involving adding or
removing strictness annotations and seq calls in various places with no
luck.  I have also been adding SCC annotations and tried to profile the
code.  This seemed to suggest the space leak happens in the get method
of the Array instance of Binary :

instance (Binary i, Ix i, Binary e) = Binary (Array i e) where
[...]
get = do
bs - get
n  - get  -- read the length
xs - replicateM n get -- now the elems.
return (listArray bs xs)

The output of the profiler tells me that all the space gets allocated
from the replicateM n get expression.

Now for the really weird part: if I load my code in GHCi and type
main, I can observe the space leak.  However, if I copy paste the
definition of main instead, the code runs fine!  This is the only
circumstance I've seen this code work instead of eating all the RAM...

I have been using GHC 6.10.1, binary 0.4.4 and bytestring-trie 0.1.2.

If there's anything else that I can do to understand what's going on, I
would gladfully hear about it.  Please also tell me if I should provide
more information.

Thanks,
-- 
Maxime Henrion

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak - help needed

2008-03-14 Thread Justin Bailey
On Thu, Mar 13, 2008 at 4:50 PM, Krzysztof Kościuszkiewicz
[EMAIL PROTECTED] wrote:
  Retainers are thunks or objects on stack that keep references to
  live objects. All retainers of an object are called the object's
  retainer set.  Now when one makes a profiling run, say with ./jobname
  +RTS -p -hr, the graph refernces retainer sets from jobname.prof. My
  understanding is that it is the total size of all objects retained by
  retainer sets being plotted, correct?

Yes, all retainer sets are being profiled. However, you can FILTER the
retainer sets profiled to those containing certain cost-centres. This
is a key point because it allows you to divide-and-conquer when
tracking down a retainer leak. That is, if you filter to a certain
cost-centre and the retainer graph is flat, you know that cost-centre
is not involved. For example, if you have a cost-centre annotation
like {-# SCC leaky #-} in your code, you can filter the retainer set
like this:

  Leaky.exe +RTS -hr -hCleaky -RTS

Review the documentation for other options.


  About decoding the sets from jobname.prof - for example in

   SET 2 = {MAIN.SYSTEM}
   SET 16 = {Main.CAF, MAIN.SYSTEM}
   SET 18 = {MAIN.SYSTEM, Main.many1,Main.list,Main.expr,Main.CAF}

  {...} means it's a set, and ccN,...,cc0 is the retainer cost centre
  (ccN) and hierarchy of parent cost centres up to the top level (cc0)?

  My understanding is that SET 18 above refers to objects that are
  retained by exactly two specified cost centres, right?


The docs say

  An object B retains object A if (i) B is a retainer object and (ii)
object A can be reached by recursively following pointers starting
from object B, but not meeting any other retainer objects on the way.
Each live object is retained by one or more retainer objects,
collectively called its retainer set ...

That says to me that SET18 above is the set of all objects which are
retained by those two call stacks, and only those call stacks. The
individual .. items aren't call stacks but I think they refer to
where the retaining object (B in the paragraph) was itself retained,
so they are like call stacks. My intuition is very fuzzy here.

  Finally, what is the MAIN.SYSTEM retainer?

I think that is everything else - any object created in the runtime
system that is not directly attributable to something being profiled.
Maybe it is objects from libraries that were not compiled with
profiling? I imagine objects created by the GHC primitives would fall
in this category too.

Since someone else found your space leak, does the retainer profiling
advice point to it? I'd like to know if it is actually accurate or
not! I've only applied it in some very limited situations.

Justin
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak - help needed

2008-03-13 Thread Bertram Felgenhauer
Krzysztof Kościuszkiewicz wrote:
 I have tried both Poly.StateLazy and Poly.State and they work quite well
 - at least the space leak is eliminated. Now evaluation of the parser
 state blows the stack...
 
 The code is at http://hpaste.org/6310

Apparently, stUpdate is too lazy. I'd define

stUpdate' :: (s - s) - Parser s t ()
stUpdate' f = stUpdate f  stGet = (`seq` return ())

and try using stUpdate' instead of stUpdate in incCount.

HTH,

Bertram
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak - help needed

2008-03-13 Thread Krzysztof Kościuszkiewicz
On Thu, Mar 13, 2008 at 05:52:05PM +0100, Bertram Felgenhauer wrote:

  ... Now evaluation of the parser state blows the stack...
  
  The code is at http://hpaste.org/6310
 
 Apparently, stUpdate is too lazy. I'd define
 
 stUpdate' :: (s - s) - Parser s t ()
 stUpdate' f = stUpdate f  stGet = (`seq` return ())
 
 and try using stUpdate' instead of stUpdate in incCount.

Yes, that solves the stack issue. Thanks!
-- 
Krzysztof Kościuszkiewicz
Skype: dr.vee,  Gadu: 111851,  Jabber: [EMAIL PROTECTED]
Simplicity is the ultimate sophistication -- Leonardo da Vinci
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak - help needed

2008-03-13 Thread Krzysztof Kościuszkiewicz
On Wed, Mar 12, 2008 at 12:34:38PM -0700, Justin Bailey wrote:

 The stack blows up when a bunch of unevaluated thunks build up, and
 you try to evaluate them. One way to determine where those thunks are
 getting built is to use GHCs retainer profiling. Retainer sets will
 show you the call stack that is holding on to memory. That can give
 you a clue where these thunks are being created. To get finer-grained
 results, annotate your code with {#- SCC ... #-} pragmas. Then you
 can filter the retainer profile by those annotations. That will help
 you determine where in a given function the thunks are being created.
 
 If you need help with profiling basics, feel free to ask.

I'm not entirely sure if I understand retainer profiling correctly... So
please clarify if you spot any obvious blunders.

Retainers are thunks or objects on stack that keep references to
live objects. All retainers of an object are called the object's
retainer set.  Now when one makes a profiling run, say with ./jobname
+RTS -p -hr, the graph refernces retainer sets from jobname.prof. My
understanding is that it is the total size of all objects retained by
retainer sets being plotted, correct?

About decoding the sets from jobname.prof - for example in

 SET 2 = {MAIN.SYSTEM}
 SET 16 = {Main.CAF, MAIN.SYSTEM}
 SET 18 = {MAIN.SYSTEM, Main.many1,Main.list,Main.expr,Main.CAF}

{...} means it's a set, and ccN,...,cc0 is the retainer cost centre
(ccN) and hierarchy of parent cost centres up to the top level (cc0)?

My understanding is that SET 18 above refers to objects that are
retained by exactly two specified cost centres, right?

Finally, what is the MAIN.SYSTEM retainer?

Thanks in advance,
-- 
Krzysztof Kościuszkiewicz
Skype: dr.vee,  Gadu: 111851,  Jabber: [EMAIL PROTECTED]
Simplicity is the ultimate sophistication -- Leonardo da Vinci
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak - help needed

2008-03-12 Thread Krzysztof Kościuszkiewicz
On Mon, Mar 03, 2008 at 05:20:09AM +0100, Bertram Felgenhauer wrote:

  Another story from an (almost) happy Haskell user that finds himself
  overwhelmed by laziness/space leaks.
  
  I'm trying to parse a large file (600MB) with a single S-expression
  like structure. With the help of ByteStrings I'm down to 4min processing
  time in constant space. However, when I try to wrap the parse results
  in a data structure, the heap blows up - even though I never actually
  inspect the structure being built! This bugs me, so I come here looking
  for answers.
 
 The polyparse library (http://www.cs.york.ac.uk/fp/polyparse/)
 offers some lazy parsers, maybe one of those fits your needs.
 Text.ParserCombinators.Poly.StateLazy is the obvious candidate.

I have tried both Poly.StateLazy and Poly.State and they work quite well
- at least the space leak is eliminated. Now evaluation of the parser
state blows the stack...

The code is at http://hpaste.org/6310

Thanks in advance,
-- 
Krzysztof Kościuszkiewicz
Skype: dr.vee,  Gadu: 111851,  Jabber: [EMAIL PROTECTED]
Simplicity is the ultimate sophistication -- Leonardo da Vinci
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak - help needed

2008-03-12 Thread Justin Bailey
On Wed, Mar 12, 2008 at 12:12 PM, Krzysztof Kościuszkiewicz
[EMAIL PROTECTED] wrote:
  I have tried both Poly.StateLazy and Poly.State and they work quite well
  - at least the space leak is eliminated. Now evaluation of the parser
  state blows the stack...

  The code is at http://hpaste.org/6310

  Thanks in advance,

The stack blows up when a bunch of unevaluated thunks build up, and
you try to evaluate them. One way to determine where those thunks are
getting built is to use GHCs retainer profiling. Retainer sets will
show you the call stack that is holding on to memory. That can give
you a clue where these thunks are being created. To get finer-grained
results, annotate your code with {#- SCC ... #-} pragmas. Then you
can filter the retainer profile by those annotations. That will help
you determine where in a given function the thunks are being created.

If you need help with profiling basics, feel free to ask.

Justin
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Space leak - help needed

2008-03-02 Thread Krzysztof Kościuszkiewicz
Dear Haskellers,

Another story from an (almost) happy Haskell user that finds himself
overwhelmed by laziness/space leaks.

I'm trying to parse a large file (600MB) with a single S-expression
like structure. With the help of ByteStrings I'm down to 4min processing
time in constant space. However, when I try to wrap the parse results
in a data structure, the heap blows up - even though I never actually
inspect the structure being built! This bugs me, so I come here looking
for answers.

Parser follows:

 module Main where
 
 import qualified Data.ByteString.Lazy.Char8 as B
 import Text.ParserCombinators.Parsec
 import Text.ParserCombinators.Parsec.Pos
 import System.Environment
 import System.Exit
 import qualified Data.Map as M
 import Lexer
 
 type XdlParser a = GenParser Token XdlState a
 
 -- Parser state
 type XdlState = Counts
 
 type Counts = M.Map Count Integer
 
 data Count = ListCount | SymbolCount
 deriving (Eq, Ord, Show)
 
 emptyXdlState = M.empty
 
 incCount :: Count - (Counts - Counts)
 incCount c = M.insertWith' (+) c 1
 
 -- handling tokens
 myToken  :: (Token - Maybe a) - XdlParser a
 myToken test  = token showTok posTok testTok
 where
 showTok = show
 posTok  = const (initialPos )
 testTok = test
 
 -- Syntax of expressions
 data Exp = Sym !B.ByteString | List ![Exp]
 deriving (Eq, Show)
 
 expr =  list | symbol
 
 rparen = myToken $ \t - case t of
 RParen  - Just ()
 other   - Nothing
 
 lparen = myToken $ \t - case t of
 LParen  - Just ()
 other   - Nothing
 
 name = myToken $ \t - case t of
 Name n - Just n
 other  - Nothing
 
 list = do
 updateState $ incCount ListCount
 lparen
 xs - many1 expr
 rparen
 return ()
 --  return $! (List xs)
 
 symbol = do
 updateState $ incCount SymbolCount
 name  return ()
 --  Sym `fmap` name
 
 -- Top level parser
 top :: XdlParser XdlState
 top =  do
 l - many1 list
 eof
 getState
 
 main = do
 args - getArgs
 case args of
 [fname] - do
 text - B.readFile fname
 let result = runParser top emptyXdlState fname (tokenize text)
 putStrLn $ either show show result
 _ - putStrLn usage: parse filename  exitFailure

And the Lexer:

 module Lexer (Token(..), tokenize) where
 
 import qualified Data.ByteString.Lazy.Char8 as B
 import Control.Monad
 import Data.Char
 import Data.List
 import System.Environment
 import System.Exit
 
 data Token = LParen
| RParen
| Name B.ByteString
 deriving (Ord, Eq, Show)
 
 type Input = B.ByteString
 
 -- Processor returns Nothing if it can't process the Input
 type Processor = Input - Maybe ([Token], Input)
 
 -- Tokenize ends the list when all processors return Nothing
 tokenize :: Input - [Token]
 tokenize  = concat . unfoldr processAll
 where
 processors= [doSpaces, doComment, doParens, doName]
 processAll   :: Processor
 processAll bs = if B.null bs 
 then Nothing
 else foldr mminus Nothing $ map ($ bs) processors
 mminus a@(Just _) _ = a
 mminus Nothingb = b
 
 doSpaces:: Processor
 doSpaces bs =
 if B.null sp
 then Nothing
 else Just ([], nsp)
 where
 (sp, nsp) = B.span isSpace bs
 
 doComment:: Processor
 doComment bs =
 if B.pack #  `B.isPrefixOf` bs
 then Just ([], B.dropWhile (/= '\n') bs)
 else Nothing
 
 doParens :: Processor
 doParens bs  = case B.head bs of
 '(' - Just ([LParen], B.tail bs)
 ')' - Just ([RParen], B.tail bs)
 _   - Nothing
 
 doName   :: Processor
 doName  bs   =
 if B.null nsp
 then Nothing
 else Just ([Name nsp], sp)
 where
 (nsp, sp) = B.span (not . isRest) bs
 isRest c = isSpace c || c == ')' || c == '('

Regards,
-- 
Krzysztof Kościuszkiewicz
Skype: dr.vee,  Gadu: 111851,  Jabber: [EMAIL PROTECTED]
Simplicity is the ultimate sophistication -- Leonardo da Vinci
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak - help needed

2008-03-02 Thread Luke Palmer
On Mon, Mar 3, 2008 at 2:23 AM, Krzysztof Kościuszkiewicz
[EMAIL PROTECTED] wrote:
 Dear Haskellers,

  Another story from an (almost) happy Haskell user that finds himself
  overwhelmed by laziness/space leaks.

  I'm trying to parse a large file (600MB) with a single S-expression
  like structure. With the help of ByteStrings I'm down to 4min processing
  time in constant space. However, when I try to wrap the parse results
  in a data structure, the heap blows up - even though I never actually
  inspect the structure being built! This bugs me, so I come here looking
  for answers.

Well, I haven't read this through, but superficially, it looks like
you're expecting the data structure to be constructed lazily.  But...

   -- Syntax of expressions
   data Exp = Sym !B.ByteString | List ![Exp]
   deriving (Eq, Show)

It is declared as strict, so it's not going to be constructed lazily...

Luke
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak - help needed

2008-03-02 Thread Bertram Felgenhauer
Krzysztof Kościuszkiewicz wrote:
 Another story from an (almost) happy Haskell user that finds himself
 overwhelmed by laziness/space leaks.
 
 I'm trying to parse a large file (600MB) with a single S-expression
 like structure. With the help of ByteStrings I'm down to 4min processing
 time in constant space. However, when I try to wrap the parse results
 in a data structure, the heap blows up - even though I never actually
 inspect the structure being built! This bugs me, so I come here looking
 for answers.

Note that Parsec has to parse the whole file before it can decide
whether to return a result (Left _) or an error (Right _). ghc would
have to be quite smart to eliminate the creation of the expression
tree entirely.

The polyparse library (http://www.cs.york.ac.uk/fp/polyparse/)
offers some lazy parsers, maybe one of those fits your needs.
Text.ParserCombinators.Poly.StateLazy is the obvious candidate.

HTH,

Bertram
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] space leak?

2007-11-02 Thread Justin Bailey
Massimiliano,

I had to update your code for it to compile (removed sequence from
testpdf'. However, I don't see any significant difference in the
memory profile of either testpdf or testpdf'.

Not sure how you are watching the memory usage, but if you didn't know
the option +RTS -sstderr will print out useful memory statistics
when you run your program. E.g.:

   pdf_test.exe +RTS -sstderr

gives:

  2,157,524,764 bytes allocated in the heap
  246,516,688 bytes copied during GC (scavenged)
  6,086,688 bytes copied during GC (not scavenged)
  45,107,704 bytes maximum residency (8 sample(s))
   4086 collections in generation 0 (  0.61s)
  8 collections in generation 1 (  0.67s)
129 Mb total memory in use
  INIT  time0.02s  (  0.00s elapsed)
  MUT   time5.83s  (  7.48s elapsed)
  GCtime1.28s  (  1.45s elapsed)
  RPtime0.00s  (  0.00s elapsed)
  PROF  time0.00s  (  0.00s elapsed)
  EXIT  time0.00s  (  0.00s elapsed)
  Total time7.13s  (  8.94s elapsed)
  %GC time  18.0%  (16.3% elapsed)
  Alloc rate369,202,098 bytes per MUT second
  Productivity  81.8% of total user, 65.2% of total elapsed

Above you can see 45 MB was the max amount of memory ever in use - and
according to the heap profiling I did it's about constant. I saw the
same results when using testpdf'.

A few tricks I've learned to reduce space usage:

  * Use strict returns ( return $! ...)
  * foldl' over foldr unless you have to use foldr.
  * Profile, profile, profile - understand who is hanging on to the
memory (+RTS -hc) and how it's being used (+RTS -hb).
  * Use +RTS -p to understand who's doing all the allocations and
where your time is being spent.
  * Approach profiling like a science experiment - make one change,
observe if anything is different, rollback and make another change -
observer the change. Keep notes!

Good luck!

Justin
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] space leak?

2007-11-02 Thread Massimiliano Gubinelli
 ( these two lines are just to fool the gmane post algorithm which 
   complains for top-posting)


Hi,
 i'm learning Haskell and trying to use the HPDF 1.2 library I've come
 across some large memory  consumption for which I do not understand
 the origin. I've tried heap profiling but without much  success.
This is my code 


 module Main where

 import Control.Monad.State
 import Graphics.PDF 

 data Opcodes = Rect | Ship  deriving (Show)

 doPage (Rect:ops) = do
  stroke $! Rectangle 10.0 10.0  10.0 10.0
  doPage ops   

 doPage l = return l

 doOps [] = return ()

 doOps (Ship:ops) = {-# SCC OPSHIP #-}  do  
 p - addPage Nothing
 ops' - drawWithPage p $! do 
   strokeColor red
   applyMatrix $ (translate 72.0 72.0)
   doPage ops  
 doOps ops'

 doOps (op:_) = error (unexpected  ++ show op)

 testpdf =  do
let ops = concat $ replicate 100 (Ship : (replicate 1000 Rect )) 
pageRect = PDFRect 0 0 
   (floor $ 21.0/2.54*72.0) (floor $ 29.7/2.54*72.0)
runPdf test1.pdf (standardDocInfo { author=toPDFString mgubi, 
  compressed = False}) 
pageRect $ doOps ops


 testpdf' = do  
let pageRect = PDFRect 0 0 
   (floor $ 21.0/2.54*72.0) (floor $ 29.7/2.54*72.0)
runPdf full.pdf (standardDocInfo { author=toPDFString mgubi, 
 compressed = False}) 
 pageRect $ sequence_ $ foldM f [] $ replicate 100 $ 
 (\p - sequence_ $ replicate 1000 $ 
   drawWithPage p $ stroke $ 
Rectangle 0.0 0.0 10.0 10.0)
 where f ps acts = do
 p - addPage Nothing
 acts p
 return $ p:ps

 main = testpdf


now, if I run testpdf' then memory profile is very low and everything
is as expected while if I run testpdf  then the profile grows up to
80MB and more. This is the stripped down version of the original
program  (which is a DVI interpreter) so there I will have also some
StateT and more complicated opcodes. I would  like to know what is
wrong with the above code. Could someone help me? 

thanks,

Massimiliano Gubinelli


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space Leak Help

2007-02-04 Thread Dominic Steinitz
On Saturday 03 February 2007 19:56, Pepe Iborra wrote:
 pad :: [Word8] - [Word8]
 pad xs = pad' xs 0

 pad' (x:xs) l = x : pad' xs (succ l)
 pad' [] l = [0x80] ++ ps ++ lb
     where
        pl = (64-(l+9)) `mod` 64
        ps = replicate pl 0x00
        lb = i2osp 8 (8*l)
Pepe,

Thanks but this gives me a different problem

[EMAIL PROTECTED]:~/sha1 ./allInOne 101 +RTS -hc -RTS
[2845392438,1191608682,3124634993,2018558572,2630932637]
[2224569924,473682542,3131984545,4182845925,3846598897]
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.

Someone suggested

pad :: Num a = [a] - [a]
pad = pad' 0
  where pad' !l [] = [0x80] ++ ps ++ lb
  where pl = (64-(l+9)) `mod` 64
ps = replicate pl 0x00
lb = i2osp 8 (8*l)
pad' !l (x:xs) = x : pad' (l+1) xs

but that didn't compile

*Main :r
[2 of 2] Compiling Main ( allInOne.hs, interpreted )

allInOne.hs:83:14: Parse error in pattern
Failed, modules loaded: Codec.Utils.

Dominic.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space Leak Help

2007-02-04 Thread Stefan O'Rear
On Sun, Feb 04, 2007 at 08:20:23AM +, Dominic Steinitz wrote:
 Someone suggested
 
 pad :: Num a = [a] - [a]
 pad = pad' 0
   where pad' !l [] = [0x80] ++ ps ++ lb
   where pl = (64-(l+9)) `mod` 64
 ps = replicate pl 0x00
 lb = i2osp 8 (8*l)
 pad' !l (x:xs) = x : pad' (l+1) xs
 
 but that didn't compile
 
 *Main :r
 [2 of 2] Compiling Main ( allInOne.hs, interpreted )
 
 allInOne.hs:83:14: Parse error in pattern
 Failed, modules loaded: Codec.Utils.
 
 Dominic.

The '!' is a GHC extension, enabled using the flag '-fbang-patterns'.

Equivalently, you can use Haskell 98's seq :

pad :: Num a = [a] - [a]
pad = pad' 0
  where pad' l [] | l `seq` False = undefined
pad' l [] = [0x80] ++ ps ++ lb
  where pl = (64-(l+9)) `mod` 64
ps = replicate pl 0x00
lb = i2osp 8 (8*l)
pad' l (x:xs) = x : pad' (l+1) xs

The first alternative never succeeds, but to see that the compiler
must force the evaluation of 'l'.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space Leak Help

2007-02-04 Thread Dominic Steinitz
On Saturday 03 February 2007 19:42, [EMAIL PROTECTED] wrote:
   I have re-written SHA1 so that is more idiomatically haskell and it is
   easy to see how it implements the specification. The only problem is I
   now have a space leak. I can see where the leak is but I'm less sure
   what to do about getting rid of it.
  
   Here's the offending function:
  
   pad :: [Word8] - [Word8]
   pad xs =
  xs ++ [0x80] ++ ps ++ lb
  where
 l = length xs
 pl = (64-(l+9)) `mod` 64
 ps = replicate pl 0x00
 lb = i2osp 8 (8*l)

 I would try something along the following lines (untested):

 \begin{spec}
 catWithLen xs f = xs ++ f (length xs)
 \end{spec}

 \begin{code}
 catWithLen :: [a] - (Int - [a]) - [a]
 catWithLen xs f = h 0 xs
   where
 h k [] = f k
 h k (x : xs) = case succ k of-- forcing evaluation
  k' - x : h k' xs
 \end{code}

 \begin{code}
 pad :: [Word8] - [Word8]
 pad xs = catWithLen xs f
   where
 f l = 0x80 : ps lb
   where
  -- we know that |l = length xs|
  pl = (64-(l+9)) `mod` 64
  ps = funPow pl (0x00 :)
  lb = i2osp 8 (8*l)
 \end{code}

 If you are lucky, then the replicate and the (++lb) in the original code
 might be fused by the compiler as an instance of foldr-build
 or something related --- check the optimised core code.

Wolfram,

Thanks but this gives a different problem:

[EMAIL PROTECTED]:~/sha1 ./allInOne 101 +RTS -hc -RTS
[2845392438,1191608682,3124634993,2018558572,2630932637]
[2224569924,473682542,3131984545,4182845925,3846598897]
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.

Dominic.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space Leak Help

2007-02-04 Thread Stefan O'Rear
On Sun, Feb 04, 2007 at 08:30:44AM +, Dominic Steinitz wrote:
 On Saturday 03 February 2007 19:42, [EMAIL PROTECTED] wrote:
  I would try something along the following lines (untested):
 
  \begin{spec}
  catWithLen xs f = xs ++ f (length xs)
  \end{spec}
 
  \begin{code}
  catWithLen :: [a] - (Int - [a]) - [a]
  catWithLen xs f = h 0 xs
where
  h k [] = f k
  h k (x : xs) = case succ k of-- forcing evaluation
   k' - x : h k' xs

Nice try.  k', as a variable binding, is irrefutable.

  \end{code}
 
  \begin{code}
  pad :: [Word8] - [Word8]
  pad xs = catWithLen xs f
where
  f l = 0x80 : ps lb
where
   -- we know that |l = length xs|
   pl = (64-(l+9)) `mod` 64
   ps = funPow pl (0x00 :)
   lb = i2osp 8 (8*l)
  \end{code}

 Thanks but this gives a different problem:
 
 [EMAIL PROTECTED]:~/sha1 ./allInOne 101 +RTS -hc -RTS
 [2845392438,1191608682,3124634993,2018558572,2630932637]
 [2224569924,473682542,3131984545,4182845925,3846598897]
 Stack space overflow: current size 8388608 bytes.
 Use `+RTS -Ksize' to increase it.

expected result of the excessive laziness above.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space Leak Help

2007-02-04 Thread Dominic Steinitz
On Sunday 04 February 2007 08:28, Stefan O'Rear wrote:
 On Sun, Feb 04, 2007 at 08:20:23AM +, Dominic Steinitz wrote:
  Someone suggested
 
  pad :: Num a = [a] - [a]
  pad = pad' 0
where pad' !l [] = [0x80] ++ ps ++ lb
where pl = (64-(l+9)) `mod` 64
  ps = replicate pl 0x00
  lb = i2osp 8 (8*l)
  pad' !l (x:xs) = x : pad' (l+1) xs
 
  but that didn't compile
 
  *Main :r
  [2 of 2] Compiling Main ( allInOne.hs, interpreted )
 
  allInOne.hs:83:14: Parse error in pattern
  Failed, modules loaded: Codec.Utils.
 
  Dominic.

 The '!' is a GHC extension, enabled using the flag '-fbang-patterns'.

The test program runs to completion but still has a space leak consuming over 
25m.

 Equivalently, you can use Haskell 98's seq :

 pad :: Num a = [a] - [a]
 pad = pad' 0
   where pad' l [] | l `seq` False = undefined
 pad' l [] = [0x80] ++ ps ++ lb
   where pl = (64-(l+9)) `mod` 64
 ps = replicate pl 0x00
 lb = i2osp 8 (8*l)
 pad' l (x:xs) = x : pad' (l+1) xs

 The first alternative never succeeds, but to see that the compiler
 must force the evaluation of 'l'.

[EMAIL PROTECTED]:~/sha1 ./allInOne 101 +RTS -hc -RTS
[2845392438,1191608682,3124634993,2018558572,2630932637]
[2224569924,473682542,3131984545,4182845925,3846598897]
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.

Dominic.

PS I appreciate all the help I'm getting.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space Leak Help

2007-02-04 Thread Stefan O'Rear
On Sun, Feb 04, 2007 at 09:45:12AM +, Dominic Steinitz wrote:
  pad :: Num a = [a] - [a]
  pad = pad' 0
where pad' l [] | l `seq` False = undefined

Stupid typo, that should be:

  where pad' l _ | l `seq` False = undefined

  pad' l [] = [0x80] ++ ps ++ lb
where pl = (64-(l+9)) `mod` 64
  ps = replicate pl 0x00
  lb = i2osp 8 (8*l)
  pad' l (x:xs) = x : pad' (l+1) xs
 
  The first alternative never succeeds, but to see that the compiler
  must force the evaluation of 'l'.
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space Leak Help

2007-02-04 Thread kahl
  
   \begin{code}
   catWithLen :: [a] - (Int - [a]) - [a]
   catWithLen xs f = h 0 xs
 where
   h k [] = f k
   h k (x : xs) = case succ k of-- forcing evaluation
k' - x : h k' xs
   \end{code}
  
  
  Thanks but this gives a different problem:
  
  [EMAIL PROTECTED]:~/sha1 ./allInOne 101 +RTS -hc -RTS
  [2845392438,1191608682,3124634993,2018558572,2630932637]
  [2224569924,473682542,3131984545,4182845925,3846598897]
  Stack space overflow: current size 8388608 bytes.
  Use `+RTS -Ksize' to increase it.


Does it still do that if you youse seq instead of case?


\begin{code}
catWithLen :: [a] - (Int - [a]) - [a]
catWithLen xs f = h 0 xs
  where
h k [] = f k
h k (x : xs) =  let k' = succ k
in k' `seq` x : h k' xs
\end{code}


Wolfram
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Space Leak Help

2007-02-03 Thread Dominic Steinitz
I have re-written SHA1 so that is more idiomatically haskell and it is easy to 
see how it implements the specification. The only problem is I now have a 
space leak. I can see where the leak is but I'm less sure what to do about 
getting rid of it.

Here's the offending function:

pad :: [Word8] - [Word8]
pad xs =
   xs ++ [0x80] ++ ps ++ lb
   where
  l = length xs
  pl = (64-(l+9)) `mod` 64
  ps = replicate pl 0x00
  lb = i2osp 8 (8*l)

I've thought about zipping the xs with [1..] which will give me a length as I 
go. Is this the right way to go are there better techniques for dealing with 
this?

I've attached the full source below.

Dominic.

module Main(main) where

import Data.Char
import Data.Bits
import Data.List
import Data.Word
import System
import Codec.Utils

type Rotation = Int

rotL :: Rotation - Word32 - Word32
rotL s a = shiftL a s .|. shiftL a (s-32)

instance Num [Word32] where
   a + b = zipWith (+) a b

f n x y z 
   | (0 = n)   (n = 19) = (x .. y) .|. ((complement x) .. z)
   | (20 = n)  (n = 39) = x `xor` y `xor` z
   | (40 = n)  (n = 59) = (x .. y) .|. (x .. z) .|. (y .. z)
   | (60 = n)  (n = 79) = x `xor` y `xor` z
   | otherwise = error invalid index for f

k n
   | (0 = n)   (n = 19) = 0x5a827999
   | (20 = n)  (n = 39) = 0x6ed9eba1
   | (40 = n)  (n = 59) = 0x8f1bbcdc
   | (60 = n)  (n = 79) = 0xca62c1d6
   | otherwise = error invalid index for k

-- Word120 - Word512 - Word120 
oneBlock ss xs = (as!!80):(bs!!80):(cs!!80):(ds!!80):(es!!80):[]
   where
  ws = xs ++ map (rotL 1) (zipWith4 xxxor wm3s wm8s wm14s wm16s)
 where 
xxxor a b c d = a `xor` b `xor` c `xor` d
wm3s  = drop (16-3)  ws
wm8s  = drop (16-8)  ws
wm14s = drop (16-14) ws
wm16s = drop (16-16) ws
  as = (ss!!0):ts
  bs = (ss!!1):as
  cs = (ss!!2):(map (rotL 30) bs)
  ds = (ss!!3):cs 
  es = (ss!!4):ds
  ts = (map (rotL 5) as) + (zipWith4 f [0..] bs cs ds) + es + (map k 
[0..]) + ws

ss :: [Word32]
ss = [0x67452301, 0xefcdab89, 0x98badcfe, 0x10325476, 0xc3d2e1f0]

pad :: [Word8] - [Word8]
pad xs =
   xs ++ [0x80] ++ ps ++ lb
   where
  l = length xs
  pl = (64-(l+9)) `mod` 64
  ps = replicate pl 0x00
  lb = i2osp 8 (8*l)

blockWord8sIn512 :: [Word8] - [[Word8]]
blockWord8sIn512 =
   unfoldr g
   where
  g [] = Nothing
  g xs = Just (splitAt 64 xs)

getWord32s :: [Word8] - [Word32]
getWord32s s = 
   map f [0..15]
   where 
  f i = foldl (+) 0 $ map (\n - toEnum (fromEnum (s!!(i*4+n))) `shiftL` 
(fromIntegral (8 * (3-n [0..3]

blockWord32sIn512 :: [Word8] - [[Word32]]
blockWord32sIn512 = (map getWord32s) . blockWord8sIn512 . pad

-- Word120 - Word512 - Word120
hashOnce ss a = ss + oneBlock ss a

hash = foldl' hashOnce ss . blockWord32sIn512

convert :: String - [Word8]
convert = map (fromIntegral . ord)

short :: [Word8]
short = convert abc

message :: [Word8]
message = convert abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq

performance n =
   (convert . take n . repeat) 'a'

test n = mapM_ (putStrLn . show . hash) [short, message, performance n]

main =
   do progName - getProgName
  args - getArgs
  if length args /= 1
 then putStrLn (Usage:  ++ progName ++  testSize)
 else test (read (args!!0))



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space Leak Help

2007-02-03 Thread kahl
  
  I have re-written SHA1 so that is more idiomatically haskell and it is easy 
  to 
  see how it implements the specification. The only problem is I now have a 
  space leak. I can see where the leak is but I'm less sure what to do about 
  getting rid of it.
  
  Here's the offending function:
  
  pad :: [Word8] - [Word8]
  pad xs =
 xs ++ [0x80] ++ ps ++ lb
 where
l = length xs
pl = (64-(l+9)) `mod` 64
ps = replicate pl 0x00
lb = i2osp 8 (8*l)


I would try something along the following lines (untested):

\begin{spec}
catWithLen xs f = xs ++ f (length xs)
\end{spec}

\begin{code}
catWithLen :: [a] - (Int - [a]) - [a]
catWithLen xs f = h 0 xs
  where
h k [] = f k
h k (x : xs) = case succ k of-- forcing evaluation
 k' - x : h k' xs
\end{code}

\begin{code}
pad :: [Word8] - [Word8]
pad xs = catWithLen xs f
  where
f l = 0x80 : ps lb
  where
 -- we know that |l = length xs|
 pl = (64-(l+9)) `mod` 64
 ps = funPow pl (0x00 :)
 lb = i2osp 8 (8*l)
\end{code}

If you are lucky, then the replicate and the (++lb) in the original code
might be fused by the compiler as an instance of foldr-build
or something related --- check the optimised core code. 

In my variant I changed this to rely on efficient function powering
instead:

\begin{spec}
funPow k f = foldr (.) id $ replicate k f
\end{spec}

\begin{code}
funPow :: Int - (a - a) - (a - a)
funPow n f = case compare n 0 of
LT - error (funPow: negative argument:  ++ show n)
EQ - id
GT - pow n f
  where
pow m g = if m  1
  then let (m',r) = divMod m 2
   g' = g . g
   in if r == 0
  then pow m' g'
  else pow m' g' . g
  else g
\end{code}

(You will probably also consider using Data.Bits
 for (`mod` 64), (8*), and (`divMod` 2).
)


Wolfram
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space Leak Help

2007-02-03 Thread Pepe Iborra

hi Dominic

Explicit recursion works just fine for me and keeps things simple:

pad :: [Word8] - [Word8]
pad xs = pad' xs 0

pad' (x:xs) l = x : pad' xs (succ l)
pad' [] l = [0x80] ++ ps ++ lb
   where
  pl = (64-(l+9)) `mod` 64
  ps = replicate pl 0x00
  lb = i2osp 8 (8*l)


at the cost of (very slightly) hiding data flow.
Seems exactly what you were trying to avoid?

Cheers
pepe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Space leak whilst implementing streams

2006-08-26 Thread ephemeral . elusive

Hello,

I have been using arrows to implement stream processors. At first, I
tried using the implementation presented in John Hughes' AFP arrows
lectures. However, this appeared to have a space leak in its
implementation of the left operator for ArrowChoice.

I found a way to remove this space leak, however, I do not really
understand why there was a space leak in the first place. I would
really appreciate any light that could be shed on this.

Below I include the AFP implementation (SF) and a modified
implementation that no longer has the space leak (SF'). I am using ghc
6.4.2.

Many thanks.



import Control.Arrow
import Data.Maybe


main  = test
test  = print (runSF  p (repeat 1))
test' = print (runSF' p (repeat (Just 1)))

-- heap profile appears to grow linearly for test, but not test'
p :: ArrowChoice a = a Int (Either Int Int)
p = arr Right  left (arr id)


newtype SF a b = SF {runSF :: [a] - [b]}

instance Arrow SF where
 arr f
 = SF (map f)
 SF f  SF g
 = SF (f  g)
 first (SF f)
 = SF (unzip  first f  uncurry zip)

instance ArrowChoice SF where
 left (SF f)
 = SF (\xs - combine xs (f [y | Left y - xs]))
   where combine (Left _:xs)  (z:zs) = Left z :combine xs zs
 combine (Right r:xs) zs = Right r:combine xs zs
 combine []   _  = []


-- SF' does not exhibit the space leak
newtype SF' a b = SF' {runSF' :: [Maybe a] - [Maybe b]}

instance Arrow SF' where
 arr f
 = SF' (map (maybe Nothing (Just . f)))
 SF' f  SF' g
 = SF' (f  g)
 first (SF' f)
 = SF' (maybe_unzip  first f  uncurry maybe_zip)
   where maybe_unzip = foldr mu ([],[])
 where mu Nothing  ~(xs,ys) = (Nothing:xs, Nothing: ys)
   mu (Just (x,y)) ~(xs,ys) = (Just x:xs, Just y: ys)

 maybe_zip = zipWith mz
 where mz (Just x) (Just y) = Just (x,y)
   mz Nothing  Nothing  = Nothing

instance ArrowChoice SF' where
 left (SF' f)
 = SF' (\xs - xs `combined` f (map drop_right xs))
   where combined = zipWith merge_left

 drop_right Nothing  = Nothing
 drop_right (Just (Left l))  = Just l
 drop_right (Just (Right _)) = Nothing

 merge_left Nothing  Nothing  = Nothing
 merge_left (Just (Left _))  (Just x) = Just (Left x)
 merge_left (Just (Right r)) Nothing  = Just (Right r)
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak whilst implementing streams

2006-08-26 Thread Udo Stenzel
[EMAIL PROTECTED] wrote:
 I found a way to remove this space leak, however, I do not really
 understand why there was a space leak in the first place. I would
 really appreciate any light that could be shed on this.
 
 instance ArrowChoice SF where
  left (SF f)
  = SF (\xs - combine xs (f [y | Left y - xs]))
where combine (Left _:xs)  (z:zs) = Left z :combine xs zs
  combine (Right r:xs) zs = Right r:combine xs zs
  combine []   _  = []

The list comprehension is holding onto 'xs' until something of the 'zs'
is consumed.  That only happens when 'xs' contains 'Left _', and your
input input stream doesn't.  So the whole stream is leaked, which indeed
produces a linear space profile.  The easiest way to correct it, is to
consume the list 'xs' only once:

instance ArrowChoice SF where
 left (SF f) = SF (map f')
   where f' (Left x) = Left (f x)
 f' (Right y) = Right y

or simply

instance ArrowChoice SF where
 left (SF f) = SF (map (left f))


 instance ArrowChoice SF' where
  left (SF' f)
  = SF' (\xs - xs `combined` f (map drop_right xs))
where combined = zipWith merge_left

Here you're also risking the leak of a whole copy of 'xs', but since you
end up consuming both lists at the same pace, nothing bad happens.


Udo.
-- 
Nicht alles was hinkt ist ein Vergleich.


signature.asc
Description: Digital signature
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Space leak when returning pairs?

2006-05-19 Thread Shin-Cheng Mu

Dear members,

I am experiencing a space leak, which I suspect to be
an instance of the problem addressed by Wadler before.
I'd appreciate if someone here would take a look.

Given the following datatype:

 data XMLEvent = StartEvent String
   | EndEvent   String
   | TextEvent  String  deriving Show

where the three constructors represent the start tag
(a where a is a string), end tag (/a), and text data,
respectively, of an XML stream. (They are actually from
the library HXML). The following function
simply returns the same stream while doing a minimal
amount of validation (ignoring the closing tag).

  idX :: [XMLEvent] - ([XMLEvent], [XMLEvent])
  idX [] = ([], [])
  idX (StartEvent a : strm) =
let (ts, strm') = idX strm
(us, strm'') = idX strm'
in (StartEvent a [] : ts ++ EndEvent a : us, strm'')
  idX (EndEvent _: strm) = ([], strm)
  idX (TextEvent s : strm) =
let (ts, strm') = idX strm
in (TextEvent s : ts, strm')

The function idX returns a pair, where the first component
is the processed stream, while the second component is the
rest of the input. The intention is to thread the input
and release processed events.

If the function is used in a pipelined manner:

   print . fst . idX . parseInstance $ input

where parseInstance :: String - [XMLEvent] is a
lexical analyser, I would expect that the input stream will
be consumed one by one as the output is produced, and the
program would use a constant amount of heap space (while
keeping a program stack proportional to the depth of the
XML tree). This was not the case, however. For some reason
the heap grew linearly. The GHC profiler showed that most
of the thunks are created by parseInstance, which implies
that the entire input stream somehow still resides in memory.

I was wondering where the space leak came from and suspected
that it's the leak described in one of Philip Wadler's
early paper Fixing Some Space Leaks With a Garbage Collector (1987).
Consider

idX (StartEvent a : strm) =
  let (ts, strm') = idX strm
  (us, strm'') = idX strm'
  in (StartEvent a [] : ts ++ EndEvent a : us, strm'')

The body of the let clause might have actually been treated as

 (StartEvent a [] : ts ++
EndEvent a : fst (idX (snd strm)),
  snd (idX (snd strm)))

Therefore strm will not be freed until the whole evaluation
finishes.

But since Wadler has addressed this problem a long time ago,
I think the fix he proposed should have been implemented in
GHC already. Was that really the source of the space leak?
If so is there a way to fix it? Or is there another reason
for the leak?

   *   *   *

The function idX above actually resulted from fusing two functions:
buildF parses a stream into a forest, while idF flattens the
tree.

   data ETree = Element String [ETree]
  | Text String

   buildF :: [XMLEvent] - ([ETree], [XMLEvent])
   buildF [] = ([],[])
   buildF (StartEvent a : strm) =
 let (ts, strm') = buildF strm
 (us, strm'') = buildF strm'
 in (Element a ts : us, strm'')
   buildF (EndEvent _ : strm) = ([], strm)
   buildF (TextEvent s : strm) =
 let (ts, strm') = buildF strm
 in (Text s : ts, strm')

   idF :: [ETree] - [XMLEvent]
   idF [] = []
   idF (Element a ts : us) =
 StartEvent a : idF ts ++ EndEvent a : idF us
   idF (Text s : ts) = TextEvent s : idF ts

My original program was like

print . idF . fst . buildF . parseInstance $ input

and it had the same space leak.

By accident (assuming that the input is always a single tree),
I discovered that if I added a wrap . head into the pipe:

print . idF . wrap . head . fst . buildF . parseInstance $ input

where wrap a = [a], the heap residency would reduce by half!
The output remains the same.

My explanation is that applying a head means that
(in the definition of buildF):

buildF (StartEvent a : strm) =
 let (ts, strm') = buildF strm
 (us, strm'') = buildF strm'
 in (Element a ts : us, strm'')

the us part , which contains a reference to strm, need not
be kept. Thus the reduced heap residency.

This seems to support that the space leak resulted from
the same problem Wadler addressed before. But isn't
that solved already in GHC?

   *   *   *

I'd appreciate if someone could look into it. The actual program
is available at

   http://www.psdlab.org/~scm/hx.tar.gz

It actually uses HXML, a library by Joe (Joe English?) to
do the parsing. The main program is hxpc.hs. There is a
1 MB sized sample input file size1m.xml. There are two
simple scripts recompile and runhxpc for compiling
and running the program.

Thank you!

sincerely,
Shin-Cheng Mu


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak when returning pairs?

2006-05-19 Thread Henning Thielemann

On Fri, 19 May 2006, Shin-Cheng Mu wrote:

idX :: [XMLEvent] - ([XMLEvent], [XMLEvent])
idX [] = ([], [])
idX (StartEvent a : strm) =
  let (ts, strm') = idX strm
  (us, strm'') = idX strm'
  in (StartEvent a [] : ts ++ EndEvent a : us, strm'')
idX (EndEvent _: strm) = ([], strm)
idX (TextEvent s : strm) =
  let (ts, strm') = idX strm
  in (TextEvent s : ts, strm')


let ~(ts, strm') = idX strm
~(us, strm'') = idX strm'

?

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak when returning pairs?

2006-05-19 Thread Shin-Cheng Mu

Dear Henning,

On May 19, 2006, at 6:16 PM, Henning Thielemann wrote:

On Fri, 19 May 2006, Shin-Cheng Mu wrote:

   idX :: [XMLEvent] - ([XMLEvent], [XMLEvent])
   idX (StartEvent a : strm) =
 let (ts, strm') = idX strm
 (us, strm'') = idX strm'
 in (StartEvent a [] : ts ++ EndEvent a : us, strm'')

let ~(ts, strm') = idX strm
~(us, strm'') = idX strm'


Oh, I just tried using lazy patterns for the case for StartEvent
and TextEvent. But the space behaviour remained the same. :(

sincerely,
Shin-Cheng Mu
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak when returning pairs?

2006-05-19 Thread Malcolm Wallace
Henning Thielemann [EMAIL PROTECTED] wrote:

 let ~(ts, strm') = idX strm
 ~(us, strm'') = idX strm'

Let-bindings are already lazy, so the ~ here is redundant.

Regards,
Malcolm
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak when returning pairs?

2006-05-19 Thread Shin-Cheng Mu


On May 19, 2006, at 6:16 PM, Henning Thielemann wrote:

let ~(ts, strm') = idX strm
~(us, strm'') = idX strm'


I seem to have found a partial solution to the problem.
It's rather ugly, however, and I think there should be
a better way.

The original definition for one of the clauses was like
this:

  idX (StartEvent a : strm) =
  let (ts, strm') = idX strm
  (us, strm'') = idX strm'
  in (StartEvent a [] : ts ++ EndEvent a : us, strm'')

The intention was to thread strm through the recursive calls
and free the items in strm as soon as they are processed. However,
with this definition I needed about 8Mb of heap for a 1Mb input
file. Since I suspected that delayed evaluation of us and strm''
keeps the pointer to strm around for too long, the natural solution
seems to be forcing their evaluation. So I tried:

  idX (StartEvent a : strm) =
  case idX strm of
 (ts, strm') -
 case idX strm' of
   (us, strm'') -
 (StartEvent a [] : ts ++ EndEvent a : us, strm'')

which made things worse. The heap size increased a bit more
and I get a typical bell shaped heap profile suggesting idX
cannot output anything before processing the whole input.

So I have to allow idX to output something while consuming
the input. What eventually worked was this messy mixture
of let and case:

  idX (StartEvent a : strm) =
let (xs, rest) =
 case idX strm of
   (ts, strm') -
  let (ws, rest) =
case idX strm' of
  (us, strm'') - (us, strm'')
  in (ts ++ EndEvent a : ws, rest)
in (StartEvent a [] : xs, rest)

The heap residency reduced from about 8MB to around 80Kb.

This is rather tricky, however. The following nicer looking
version has a heap profile as bad as the worse case.

   idX (StartEvent a : strm) =
 let (ts,us,strm'') =
 case idXsp strm of
   (ts, strm') -
   case idXsp strm' of
 (us, strm'') - (ts, us, strm'')
 in (StartEvent a [] : ts ++ EndEvent a : us, strm'')

And I cannot quite explain why (Anyone?).

Is there a more structured solution? Can I leave the hard
work to the compiler?

sincerely,
Shin-Cheng Mu
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak when returning pairs?

2006-05-19 Thread Chris Kuklewicz
Shin-Cheng Mu wrote:
 Dear members,
 
 I am experiencing a space leak, which I suspect to be
 an instance of the problem addressed by Wadler before.
 I'd appreciate if someone here would take a look.
 
 Given the following datatype:
 
  data XMLEvent = StartEvent String
| EndEvent   String
| TextEvent  String  deriving Show
 
 where the three constructors represent the start tag
 (a where a is a string), end tag (/a), and text data,
 respectively, of an XML stream. (They are actually from
 the library HXML). The following function
 simply returns the same stream while doing a minimal
 amount of validation (ignoring the closing tag).
 
   idX :: [XMLEvent] - ([XMLEvent], [XMLEvent])
   idX [] = ([], [])
   idX (StartEvent a : strm) =
 let (ts, strm') = idX strm
 (us, strm'') = idX strm'
 in (StartEvent a [] : ts ++ EndEvent a : us, strm'')
   idX (EndEvent _: strm) = ([], strm)
   idX (TextEvent s : strm) =
 let (ts, strm') = idX strm
 in (TextEvent s : ts, strm')
 
 The function idX returns a pair, where the first component
 is the processed stream, while the second component is the
 rest of the input. The intention is to thread the input
 and release processed events.
 
 If the function is used in a pipelined manner:
 
print . fst . idX . parseInstance $ input
 

I would not have written idX in this manner.  My version is

 data XMLEvent = StartEvent String
   | EndEvent   String
   | TextEvent  String  deriving Show 
 
 idX' :: [XMLEvent] - [XMLEvent]
 idX' = untilTags []
 
 untilTags :: [String] - [XMLEvent] - [XMLEvent]
 untilTags [] [] = []
 untilTags tags [] = error (Did not find closing EndEvents  ++ show tags)
 untilTags tags (x:xs) =
   case x of
 StartEvent tag' - x : untilTags (tag':tags) xs
 EndEvent tag' - if null tags 
then error (Unexpected EndEvent with tag  ++ show 
 tag')
else let (tag:tags') = tags
 in if tag == tag'
  then x : untilTags tags' xs
  else error (Expected EndEvents  ++ show 
 tags ++  but found EndEvent  ++ show tag')
 TextEvent _ - x : untilTags tags xs

Here the flow of the input x : ... to the output x : ... is more obvious,
and the open tags are kept in a stack and used to check the closing tags.

The memory usage will be only slightly higher than your idX due to keeping the
list of open tag names.  If you want to remove that overhead and assume the
ending closing tags have correct strings, then you can keep a mere count of open
tags:

 countTags :: Int - [XMLEvent] - [XMLEvent]
 countTags 0 [] = []
 countTags n [] = error (Did not find the closing EndEvents  ++ show n)
 countTags n (x:xs) =
   case x of
 StartEvent tag' - x : countTags (succ n) xs
 EndEvent tag' - if 0 == n
then error (Unexpected EndEvent with tag  ++ show 
 tag')
else x : countTags (pred n) xs
 TextEvent _ - x : countTags n xs
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak when returning pairs?

2006-05-19 Thread Malcolm Wallace
Shin-Cheng Mu [EMAIL PROTECTED] wrote:

 I was wondering where the space leak came from and suspected
 that it's the leak described in one of Philip Wadler's
 early paper Fixing Some Space Leaks With a Garbage Collector (1987).

 But since Wadler has addressed this problem a long time ago,
 I think the fix he proposed should have been implemented in
 GHC already. Was that really the source of the space leak?
 If so is there a way to fix it? Or is there another reason
 for the leak?

I compiled and profiled your code using nhc98 instead of ghc.  There was
no space leak visible.  Since nhc98 does implement the Wadler GC fix
(with refinements by Sparud), I can only assume that this does indeed
account for the space improvement over ghc.

Regards,
Malcolm
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe