Re: [Haskell-cafe] hamming distance allocation
Hello John: Well I could use a packed type. The only letters that will be found in the string are ATCG so yeah I don't need unicode and those things. Will try out with vector or ByteString. Thanks! :) On Mon, Apr 19, 2010 at 2:37 PM, John Lato jwl...@gmail.com wrote: Subject: Re: [Haskell-cafe] hamming distance allocation Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: Hello all: I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs - exampl, ys - exampl] I have two hamming functions: -- hamming distance for variable length strings hamming :: String - String - Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1) -- function posted in this mailing list hamming2 :: String - String - Int hamming2 xs ys = length (filter not (zipWith (==) xs ys)) I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS The costlier function is the hamming distance COST CENTREMODULE %time %alloc hammingDistances 66.6 41.9 It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%. Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ? I could understand that there are allocations in hamming2 because we are creating pairs, but in the case of hamming there should be no allocation. Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too. Is it really necessary to use Strings? I think a packed type, e.g. Vector or ByteString, would be much more efficient here. Of course this is only likely to be a benefit if you can move away from String entirely. I suspect that hamming2 would perform better then. John ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hamming distance allocation
The strings will not be longer than 30 characters. I am doing sets of 2000 (total of 2000^2 distance computations) I am expecting that all the operations will be lazyly performed but at some point I get a memory error. Most of the memory is being allocated for the hamming distance and I am still unable to find the source of my memory leak. Regards, Arnoldo On Mon, Apr 19, 2010 at 3:47 PM, Daniel Fischer daniel.is.fisc...@web.dewrote: Am Montag 19 April 2010 14:37:33 schrieb John Lato: Is it really necessary to use Strings? I think a packed type, e.g. Vector or ByteString, would be much more efficient here. Not very much if the strings are fairly short (and the list isn't too long, so there's not a big difference in cache-friendliness). If eight-bit characters aren't enough, packing the strings into UArray Int Char gives performance quite close to ByteStrings. Of course this is only likely to be a benefit if you can move away from String entirely. I suspect that hamming2 would perform better then. John ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hamming distance allocation
Hello Daniel: My % GC time is : 75.0% (81.4% elapsed) and I am compiling with -O2. Thank you for clarifying about the pointers. Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr. I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc) ? From my .prof file where would you start looking at? Best Regards, Arnoldo Muller On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer daniel.is.fisc...@web.dewrote: Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: Hello all: I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs - exampl, ys - exampl] I have two hamming functions: -- hamming distance for variable length strings hamming :: String - String - Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1) -- function posted in this mailing list hamming2 :: String - String - Int hamming2 xs ys = length (filter not (zipWith (==) xs ys)) I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS The costlier function is the hamming distance COST CENTREMODULE %time %alloc hammingDistances 66.6 41.9 It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%. Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ? I could understand that there are allocations in hamming2 because we are creating pairs, but in the case of hamming there should be no allocation. Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too. How can I execute my hamming functions without allocating memory? Best regards, Arnoldo Muller jasparReader.prof Description: Binary data ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hamming distance allocation
Hello all: I found my leak after adding some bang patterns in a different part of the program. The compiler was generating all the combinations of the list comprehensions and therefore the performance dropped very badly. BTW, hamming is 2 times faster than hamming2. Thank you as always! Arnoldo On Mon, Apr 19, 2010 at 5:53 PM, Arnoldo Muller arnoldomul...@gmail.comwrote: Hello Daniel: My % GC time is : 75.0% (81.4% elapsed) and I am compiling with -O2. Thank you for clarifying about the pointers. Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr. I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc) ? From my .prof file where would you start looking at? Best Regards, Arnoldo Muller On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer daniel.is.fisc...@web.dewrote: Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: Hello all: I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs - exampl, ys - exampl] I have two hamming functions: -- hamming distance for variable length strings hamming :: String - String - Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1) -- function posted in this mailing list hamming2 :: String - String - Int hamming2 xs ys = length (filter not (zipWith (==) xs ys)) I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS The costlier function is the hamming distance COST CENTREMODULE %time %alloc hammingDistances 66.6 41.9 It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%. Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ? I could understand that there are allocations in hamming2 because we are creating pairs, but in the case of hamming there should be no allocation. Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too. How can I execute my hamming functions without allocating memory? Best regards, Arnoldo Muller ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hamming distance allocation
Daniel thank you for all your advice. An additional ! bang pattern in convertIntToDouble fixed the issue! Also using a foldl' did the trick. Now the program runs as it should with a constant amount of memory and in a very small amount of time. I believe these problems are one of the major sources of frustration for Haskell newbies. Things that could work in X language easily suddenly become problems in Haskell. When you overcome these issues then you feel happy again that you chose Haskell as the main programming language of your research project. Is there any guide that explains more about the bad consumption pattern. Are there any general rules defined to avoid these issues? It helped me to re-read the chapter on profiling in the Real World Haskell book to sorta understand the problem. Is there a more detailed definition of the problem than in RWH? Regards, Arnoldo On Tue, Apr 20, 2010 at 2:49 AM, Daniel Fischer daniel.is.fisc...@web.dewrote: Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller: Hello Daniel: My % GC time is : 75.0% (81.4% elapsed) and I am compiling with -O2. Very bad. Can I see the code? Thank you for clarifying about the pointers. Not to forget the Ints for counting. Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr. That might very well be the problem, if you update it with a foldr, you must construct the entire list of 2000^2 hamming-thunks before the work can begin. It's probably better to use foldl' (and make the type strict) so you can start the work immediately. I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc) For finding leaks, heap profiling (-h*) gives more info than -p. The .prof says more about where you spend your time than what hangs on to memory. ? From my .prof file where would you start looking at? - use hamming instead of hamming2 - convertIntToDouble looks suspicious - calculating a few million Hamming distances takes some time, but what about getMyStats, should that really take 25%? filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs - exampl, ys - exampl] filter (/= 0) [hamming xs ys | xs - example, ys - example] And of course, you can trivially avoid half of the work. Best Regards, Arnoldo Muller On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer daniel.is.fisc...@web.dewrote: Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller: Hello all: I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs - exampl, ys - exampl] I have two hamming functions: -- hamming distance for variable length strings hamming :: String - String - Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1) -- function posted in this mailing list hamming2 :: String - String - Int hamming2 xs ys = length (filter not (zipWith (==) xs ys)) I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS The costlier function is the hamming distance COST CENTREMODULE %time %alloc hammingDistances 66.6 41.9 It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%. Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ? I could understand that there are allocations in hamming2 because we are creating pairs, but in the case of hamming there should be no allocation. Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too. How can I execute my hamming functions without allocating memory? Best regards, Arnoldo Muller ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman
[Haskell-cafe] hamming distance allocation
Hello all: I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs - exampl, ys - exampl] I have two hamming functions: -- hamming distance for variable length strings hamming :: String - String - Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1) -- function posted in this mailing list hamming2 :: String - String - Int hamming2 xs ys = length (filter not (zipWith (==) xs ys)) I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS The costlier function is the hamming distance COST CENTREMODULE %time %alloc hammingDistances 66.6 41.9 It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%. I could understand that there are allocations in hamming2 because we are creating pairs, but in the case of hamming there should be no allocation. How can I execute my hamming functions without allocating memory? Best regards, Arnoldo Muller ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Performance question
Hello Daniel, Regarding your solution, can I apply {-# SPECIALISE ... #-} statements to datatypes I define? And if so, I am not able to import the datatypes to the module where binarySearch is. The problem is that if I import them a circular dependency is detected and the compiler gives an error. Is there a way of importing a datatype from another module do avoid this circular dependency? Thank you, Arnoldo On Thu, Mar 18, 2010 at 10:48 PM, Daniel Fischer daniel.is.fisc...@web.dewrote: Am Donnerstag 18 März 2010 21:57:34 schrieb Daniel Fischer: Contrary to my expectations, however, using unboxed arrays is slower than straight arrays (in my tests). However, a few {-# SPECIALISE #-} pragmas set the record straight. Specialising speeds up both, boxed and unboxed arrays, significantly, but now, for the specialised types, unboxed arrays are faster (note, however, that when the code for the binary search is in the same module as it is used, with optimisations, GHC will probably specialise it itself. If binarySearch is not exported, AFAIK, you can delete probably.). {-# LANGUAGE BangPatterns #-} module SATBinSearch (binarySearch) where import Data.Array.IArray import Data.Array.Base (unsafeAt) import Data.Bits {-# SPECIALISE binarySearch :: Double - Array Int Double - Int #-} {-# SPECIALISE binarySearch :: Int - Array Int Int - Int #-} {-# SPECIALISE binarySearch :: Bool - Array Int Bool - Int #-} {-# SPECIALISE binarySearch :: Char - Array Int Char - Int #-} {-# SPECIALISE binarySearch :: Float - Array Int Float - Int #-} binarySearch :: Ord a = a - Array Int a - Int binarySearch q a = go l h where (l,h) = bounds a go !lo !hi | hi lo = -(lo+1) | otherwise = case compare mv q of LT - go (m+1) hi EQ - m GT - go lo (m-1) where -- m = lo + (hi-lo) `quot` 2 m = (lo .. hi) + (lo `xor` hi) `shiftR` 1 mv = a `unsafeAt` m Use Data.Array.Unboxed and UArray if possible. Now the bit-fiddling instead of arithmetics makes a serious difference, about 20% for unboxed arrays, 17% for boxed arrays (Double), so I'd recommend that. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: Performance question
Thank you all, I will apply your suggestions to my function. Thank you for making the process of learning Haskell much easier! Arnoldo On Fri, Mar 19, 2010 at 4:21 PM, Achim Schneider bars...@web.de wrote: Arnoldo Muller arnoldomul...@gmail.com wrote: Right now, the bottleneck of my program is in binarySearch', the function must be called a few billion times. Do you have any ideas on how to improve the performance of this function? The fastest way to do a binary search is to reify it into code using TH, in Van Emde Boas layout if it's a big enough search (so that you get less cache misses) This might of course get tricky if your tree isn't compile-time static, but if you're really doing gazillions of lookups, occasionally compiling+dynamically linking code might well be worth it. -- (c) this sig last receiving data processing entity. Inspect headers for copyright history. All rights reserved. Copying, hiring, renting, performance and/or quoting of this signature prohibited. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Performance question
Hello! I am trying to implement a binary search function that returns the index of an exact or the (index + 1) where the item should be inserted in an array if the item to be searched is not found (I am not trying to insert data in the array) . Right now, the bottleneck of my program is in binarySearch', the function must be called a few billion times. Do you have any ideas on how to improve the performance of this function? import Data.Array.IArray type IntArray a = Array Int a -- The array must be 0 indexed. binarySearch :: Ord a = a - IntArray a - Int binarySearch query array = let (low, high) = bounds array in binarySearch' query array low high binarySearch' :: Ord a = a - IntArray a - Int - Int - Int binarySearch' query array !low !high | low = high = let ! mid = low + ((high - low) `div` 2) ! midVal = array ! mid in next mid midVal | otherwise = -(low + 1) where next mid midVal | midVal query = binarySearch' query array (mid + 1) high | midVal query = binarySearch' query array low (mid - 1) | otherwise = mid Thank you! Arnoldo Muller ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Jason, I am trying to use haskell in the analysis of bio data. One of the main reasons I wanted to use haskell is because lazy I/O allows you to see a large bio-sequence as if it was a string in memory. In order to achieve the same result in an imperative language I would have to write lots of error-prone iterators. I saw lazy I/O as a very strong point in favor of Haskell. Besides the space leaks that can occur and that are a bit difficult to find for a newbie like me, are there any other reasons to avoid Lazy I/O? Arnoldo. On Sat, Mar 13, 2010 at 6:46 PM, Jason Dagit da...@codersbase.com wrote: On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller arnoldomul...@gmail.comwrote: Daniel, Thank you so much for helping me out with this issue! Thanks to all the other answers from haskel-cafe members too! As a newbie, I am not able to understand why zip and map would make a problem... Is there any link I could read that could help me to understand why in this case zip and map created a leak? What are some function compositions that should be avoided when doing lazy I/O? Actually, it's lazy I/O itself that should be avoided. Jason ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Daniel, Thank you so much for helping me out with this issue! Thanks to all the other answers from haskel-cafe members too! As a newbie, I am not able to understand why zip and map would make a problem... Is there any link I could read that could help me to understand why in this case zip and map created a leak? What are some function compositions that should be avoided when doing lazy I/O? Regards, Arnoldo On Thu, Mar 11, 2010 at 11:46 PM, Daniel Fischer daniel.is.fisc...@web.dewrote: Am Donnerstag 11 März 2010 00:24:28 schrieb Daniel Fischer: Hmm, offhand, I don't see why that isn't strict enough. Turns out, mapM_ was a red herring. The villain was (zip and map). I must confess, I don't know why it sort-of worked without the mapM_, though. sort-of, because that also hung on to unnecessarily much memory, the space leak was just smaller than with the mapM_. A very small change that eliminates the space leak, is readFasta :: Int - [Char] - [Window] readFasta windowSize sequence = -- get the header let (header,rest) = span (/= '\n') sequence chr = parseChromosome header go i (w:ws) = Window w chr i : go (i+1) ws go _ [] = [] in go 0 $ slideWindow windowSize $ filter (/= '\n') rest You can improve performance by eliminating slideWindow and the intermediate Window list (merging fastaExtractor and readFasta), {-# LANGUAGE BangPatterns #-} readFasta2 :: (String - Bool) - Int - String readFasta2 test windowSize sequence = let (header,rest) = span (/= '\n') sequence chr = parseChromosome header schr = show chr go !i st@(_:tl) | test w= w ++ '\t' : schr ++ '\t' : show i ++ '\n' : go (i+1) tl | otherwise = go (i+1) tl where w = take windowSize st go _ [] = [] in go 0 (filter (/= '\n')) rest ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Space leak
Program A - If instead of the main function given above I use the following main function to process only one input file, things work OK for even the largest files. Memory usage remains constant in this case. main = do -- get the arguments [input, output, windowSize] - getArgs -- keep the input stream inpStr - readFile input let wSize = (read windowSize)::Int writeFile output $ fastaExtractor inpStr wSize filterWindow It is not easy for me to see why is Haskell keeping data in memory. Do you have any idea why program B is not working? Thank you for your help! Arnoldo Muller ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Hello Daniel: Thanks! I employed mapM'_ but I am still getting the space leak. Any other hint? Arnoldo On Wed, Mar 10, 2010 at 10:40 PM, Daniel Fischer daniel.is.fisc...@web.dewrote: Am Mittwoch 10 März 2010 21:45:56 schrieb Arnoldo Muller: Hello, I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping. No work is been done until the end, when all is tried to be done simultaneously. Make sure genomeExecute ... input1 has actually finished its work before genomeExecute ... input2 starts etc. One way is to use a stricter version of sequence_, sequence'_ :: Monad m = [m a] - m () sequence'_ (x:xs) = do a - x a `seq` sequence'_ xs sequence'_ [] = return () (nicer with BangPatterns, but not portable), and mapM'_ f = sequence'_ . map f Another option is making genomeExecute itself stricter. This is version B: --- Program B import Data.List import System.Environment import System.Directory import Control.Monad -- different types of chromosomes data Chromosome =C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w) -- Reading fasta files with haskell -- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] - getArgs -- get directory contents (only names) names - getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames -- read the files one by one and write them to the output file genomeExecute :: String - Int - (Window - Bool) - String - IO () genomeExecute outputFile windowSize f inputFile = do fileData - readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Hello Bulat, I ran program A with writeFile instead of appendFile and it still works without problems. Regarding program B, if I use writeFile the leaking still occurs. Any other hints? :) Arnoldo On Wed, Mar 10, 2010 at 10:32 PM, Bulat Ziganshin bulat.zigans...@gmail.com wrote: Hello Arnoldo, Wednesday, March 10, 2010, 11:45:56 PM, you wrote: I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. make some experiments - leave only one file and use version A, then replace appendFile with writeFile -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Space leak
Hello Justin, I tried and what I saw was a constant increase in memory usage. Any particular profiling option that you would use? I do remember that there was a particular option in which the leak would dissapear (for the same amount of work) and that is why I stopped using the profiler. Thanks, Arnoldo On Wed, Mar 10, 2010 at 10:20 PM, Justin Bailey jgbai...@gmail.com wrote: Have you use the profiling tools available with GHC? http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html On Wed, Mar 10, 2010 at 12:45 PM, Arnoldo Muller arnoldomul...@gmail.com wrote: Hello, I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. I want to lazily read a set of 22 files of about 200MB each, filter them and then I want to output the result into a unique file. If I modify the main function to work only with one input file, the program runs without issues. I will call this version A. Version B uses a mapM_ to iterate over a list of filenames and uses appendFile to output the result of filtering each file. In this case the memory usage grows sharply and quickly (profiles show constant memory growth). In less than a minute, memory occupation will make my system hang with swapping. This is version B: --- Program B import Data.List import System.Environment import System.Directory import Control.Monad -- different types of chromosomes data Chromosome =C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | C12 | C13 | C14 | C15 | C16 | C17 | C18 | C19 | CX | CY | CMT deriving (Show) -- define a window type Sequence = [Char] -- Window data data Window = Window { sequen :: Sequence, chrom :: Chromosome, pos :: Int } -- print a window instance Show Window where show w = (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos w) -- Reading fasta files with haskell -- Initialize the main = do -- get the arguments (intput is [input, output, windowSize] - getArgs -- get directory contents (only names) names - getDirectoryContents input -- prepend directory let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x) names let wSize = (read windowSize)::Int -- process the directories mapM (genomeExecute output wSize filterWindow) fullNames -- read the files one by one and write them to the output file genomeExecute :: String - Int - (Window - Bool) - String - IO () genomeExecute outputFile windowSize f inputFile = do fileData - readFile inputFile appendFile outputFile $ fastaExtractor fileData windowSize f -- isFastaFile :: String - Bool isFastaFile fileName = isSuffixOf .fa fileName -- fasta extractor (receives a Fasta String and returns a windowed string ready to be sorted) -- an example on how to compose several functions to parse a fasta file fastaExtractor :: String - Int - (Window - Bool) - String fastaExtractor input wSize f = printWindowList $ filter f $ readFasta wSize input -- MAIN FILTER that removes N elements from the strings! filterWindow :: Window - Bool filterWindow w = not (elem 'N' (sequen w)) -- print a window list (the printing makes it ready for output as raw data) printWindowList :: [Window] - String printWindowList l = unlines $ map show l -- read fasta, remove stuff that is not useful from it -- removes the readFasta :: Int - [Char] - [Window] readFasta windowSize sequence = -- get the header let (header:rest) = lines sequence chr = parseChromosome header in -- We now do the following: -- take window create counter remove newlines map (\(i, w) - Window w chr i) $ zip [0..] $ slideWindow windowSize $ filter ( '\n' /= ) $ unlines rest slideWindow :: Int - [Char] - [[Char]] slideWindow _ [] = [] slideWindow windowSize l@(_:xs) = take windowSize l : slideWindow windowSize xs -- Parse the chromosome from a fasta comment -- produce a more compact chromosome representation parseChromosome :: [Char] - Chromosome parseChromosome line | isInfixOf chromosome 1, line = C1 | isInfixOf chromosome 2, line = C2 | isInfixOf chromosome 3, line
Re: [Haskell-cafe] Space leak
Bulat, The same happens, the memory starts to quickly fill up... Arnoldo On Wed, Mar 10, 2010 at 11:16 PM, Bulat Ziganshin bulat.zigans...@gmail.com wrote: Hello Arnoldo, Wednesday, March 10, 2010, 11:45:56 PM, you wrote: I am learning haskell and I found a space leak that I find difficult to solve. I've been asking at #haskell but we could not solve the issue. what if you use program B on single file? -- Best regards, Bulatmailto:bulat.zigans...@gmail.com ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe