Re: [Haskell-cafe] hamming distance allocation

2010-04-19 Thread Arnoldo Muller
Hello John:

Well I could use a packed type. The only letters that will be found in the
string are ATCG so yeah I don't need unicode and those things.

Will try out with vector or ByteString. Thanks! :)

On Mon, Apr 19, 2010 at 2:37 PM, John Lato jwl...@gmail.com wrote:

  Subject: Re: [Haskell-cafe] hamming distance allocation
 
  Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
  Hello all:
 
  I want to generate some hamming distance statistics about a set of
  strings. As explained in another e-mail in this list, I used the
  following code to call the
  functions:
  (exampl holds the list of strings of size w)
  filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs - exampl,
  ys - exampl]
 
  I have two hamming functions:
  -- hamming distance for variable length strings
  hamming :: String - String - Int
  hamming x y = hamming' x y 0
  where
hamming' [] _ !c = c
hamming' _ [] !c = c
hamming' (x:xs) (y:ys) !c
| x == y = hamming' xs ys c
| otherwise = hamming' xs ys (c + 1)
 
  -- function posted in this mailing list
  hamming2 :: String - String - Int
  hamming2 xs ys = length (filter not (zipWith (==) xs ys))
 
  I am executing these functions millions of times and the bottleneck of
  my program is in them as explained by running in profiling mode with
  +RTS -K400M -p -RTS
 
  The costlier function is the hamming distance
  COST CENTREMODULE   %time %alloc
 
  hammingDistances 66.6   41.9
 
  It says that it is performing 41% of the allocations. In the case of
  hamming2 the allocations go as far as 52%.
 
  Allocations are cheap, so that's not necessarily a problem. More
 important
  is, what's the maximum residency and how much is copied during GC?
  Are you compiling with -O2 ?
 
  I could understand that
  there are allocations in hamming2 because we are creating pairs, but
  in the case of hamming there should be no allocation.
 
  Why not? I don't know how GHC counts allocations, but everytime you go
 from
  (x:xs) to xs, you need a new pointer to the tail. If that counts as
  allocation, hamming must allocate a lot, too.

 Is it really necessary to use Strings?  I think a packed type, e.g.
 Vector or ByteString, would be much more efficient here.  Of course
 this is only likely to be a benefit if you can move away from String
 entirely.

 I suspect that hamming2 would perform better then.

 John
 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hamming distance allocation

2010-04-19 Thread Arnoldo Muller
The strings will not be longer than 30 characters.
I am doing sets of 2000  (total of 2000^2 distance computations)

I am expecting that all the operations will be lazyly performed but at some
point I get a memory error.

Most of the memory is being allocated for the hamming distance and I am
still unable to find the source of my memory leak.

Regards,

Arnoldo

On Mon, Apr 19, 2010 at 3:47 PM, Daniel Fischer daniel.is.fisc...@web.dewrote:

 Am Montag 19 April 2010 14:37:33 schrieb John Lato:
  Is it really necessary to use Strings?  I think a packed type, e.g.
  Vector or ByteString, would be much more efficient here.

 Not very much if the strings are fairly short (and the list isn't too long,
 so there's not a big difference in cache-friendliness).
 If eight-bit characters aren't enough, packing the strings into
 UArray Int Char gives performance quite close to ByteStrings.

  Of course this is only likely to be a benefit if you can move away from
  String entirely.
 
  I suspect that hamming2 would perform better then.
 
  John

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hamming distance allocation

2010-04-19 Thread Arnoldo Muller
Hello Daniel:

My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2.
Thank you for clarifying about the pointers.

Slowly my memory grows up and eventually it explodes. I would expect that
the list comprehension is lazily evaluated and therefore at any given time I
am only executing one hamming distance. The result of the hamming distance
is stored into a small statistics datatype I built (only stores sums and sum
of squares and the counts). This datatype is updated using a foldr.

I have no idea where the leak is. What do you see in a .prof file to find a
leak (hamming distance has the largest amount of time and %alloc)  ? From my
.prof file where would you start looking at?


Best Regards,

Arnoldo Muller



On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer daniel.is.fisc...@web.dewrote:

 Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
  Hello all:
 
  I want to generate some hamming distance statistics about a set of
  strings. As explained in another e-mail in this list, I used the
  following code to call the
  functions:
  (exampl holds the list of strings of size w)
  filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs - exampl,
  ys - exampl]
 
  I have two hamming functions:
  -- hamming distance for variable length strings
  hamming :: String - String - Int
  hamming x y = hamming' x y 0
  where
hamming' [] _ !c = c
hamming' _ [] !c = c
hamming' (x:xs) (y:ys) !c
| x == y = hamming' xs ys c
| otherwise = hamming' xs ys (c + 1)
 
  -- function posted in this mailing list
  hamming2 :: String - String - Int
  hamming2 xs ys = length (filter not (zipWith (==) xs ys))
 
  I am executing these functions millions of times and the bottleneck of
  my program is in them as explained by running in profiling mode with
  +RTS -K400M -p -RTS
 
  The costlier function is the hamming distance
  COST CENTREMODULE   %time %alloc
 
  hammingDistances 66.6   41.9
 
  It says that it is performing 41% of the allocations. In the case of
  hamming2 the allocations go as far as 52%.

 Allocations are cheap, so that's not necessarily a problem. More important
 is, what's the maximum residency and how much is copied during GC?
 Are you compiling with -O2 ?

  I could understand that
  there are allocations in hamming2 because we are creating pairs, but
  in the case of hamming there should be no allocation.

 Why not? I don't know how GHC counts allocations, but everytime you go from
 (x:xs) to xs, you need a new pointer to the tail. If that counts as
 allocation, hamming must allocate a lot, too.

 
  How can I execute my hamming functions without allocating memory?
 
  Best regards,
 
  Arnoldo Muller




jasparReader.prof
Description: Binary data
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hamming distance allocation

2010-04-19 Thread Arnoldo Muller
Hello all:

I found my leak after adding some bang patterns in a different part of the
program. The compiler was generating all the combinations of the list
comprehensions and therefore the performance dropped very badly.

BTW, hamming is 2 times faster than hamming2.

Thank you as always!

Arnoldo

On Mon, Apr 19, 2010 at 5:53 PM, Arnoldo Muller arnoldomul...@gmail.comwrote:

 Hello Daniel:

 My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2.
 Thank you for clarifying about the pointers.

 Slowly my memory grows up and eventually it explodes. I would expect that
 the list comprehension is lazily evaluated and therefore at any given time I
 am only executing one hamming distance. The result of the hamming distance
 is stored into a small statistics datatype I built (only stores sums and sum
 of squares and the counts). This datatype is updated using a foldr.

 I have no idea where the leak is. What do you see in a .prof file to find a
 leak (hamming distance has the largest amount of time and %alloc)  ? From my
 .prof file where would you start looking at?


 Best Regards,

 Arnoldo Muller




 On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer 
 daniel.is.fisc...@web.dewrote:

 Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
  Hello all:
 
  I want to generate some hamming distance statistics about a set of
  strings. As explained in another e-mail in this list, I used the
  following code to call the
  functions:
  (exampl holds the list of strings of size w)
  filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs - exampl,
  ys - exampl]
 
  I have two hamming functions:
  -- hamming distance for variable length strings
  hamming :: String - String - Int
  hamming x y = hamming' x y 0
  where
hamming' [] _ !c = c
hamming' _ [] !c = c
hamming' (x:xs) (y:ys) !c
| x == y = hamming' xs ys c
| otherwise = hamming' xs ys (c + 1)
 
  -- function posted in this mailing list
  hamming2 :: String - String - Int
  hamming2 xs ys = length (filter not (zipWith (==) xs ys))
 
  I am executing these functions millions of times and the bottleneck of
  my program is in them as explained by running in profiling mode with
  +RTS -K400M -p -RTS
 
  The costlier function is the hamming distance
  COST CENTREMODULE   %time %alloc
 
  hammingDistances 66.6   41.9
 
  It says that it is performing 41% of the allocations. In the case of
  hamming2 the allocations go as far as 52%.

 Allocations are cheap, so that's not necessarily a problem. More important
 is, what's the maximum residency and how much is copied during GC?
 Are you compiling with -O2 ?

  I could understand that
  there are allocations in hamming2 because we are creating pairs, but
  in the case of hamming there should be no allocation.

 Why not? I don't know how GHC counts allocations, but everytime you go
 from
 (x:xs) to xs, you need a new pointer to the tail. If that counts as
 allocation, hamming must allocate a lot, too.

 
  How can I execute my hamming functions without allocating memory?
 
  Best regards,
 
  Arnoldo Muller



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hamming distance allocation

2010-04-19 Thread Arnoldo Muller
Daniel thank you for all your advice.

An additional ! bang pattern in convertIntToDouble fixed the issue! Also
using a foldl'
did the trick.

Now the program runs as it should with a constant amount of memory and in a
very small amount of time.

I believe these problems are one of the major sources of frustration for
Haskell newbies. Things that could work in X language easily suddenly
become problems in Haskell. When you overcome these issues then you feel
happy again that you chose Haskell as the main programming language of your
research project.

Is there any guide that explains more about the bad consumption pattern.
Are there any general rules defined to avoid these issues? It helped me to
re-read the chapter on profiling in the Real World Haskell book to sorta
understand the problem. Is there a more detailed definition of the problem
than in RWH?

Regards,

Arnoldo

On Tue, Apr 20, 2010 at 2:49 AM, Daniel Fischer daniel.is.fisc...@web.dewrote:

 Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller:
  Hello Daniel:
 
  My % GC time is : 75.0%  (81.4% elapsed) and I am compiling with -O2.

 Very bad. Can I see the code?

  Thank you for clarifying about the pointers.

 Not to forget the Ints for counting.

 
  Slowly my memory grows up and eventually it explodes. I would expect
  that the list comprehension is lazily evaluated and therefore at any
  given time I am only executing one hamming distance. The result of the
  hamming distance is stored into a small statistics datatype I built
  (only stores sums and sum of squares and the counts). This datatype is
  updated using a foldr.

 That might very well be the problem, if you update it with a foldr, you
 must construct the entire list of 2000^2 hamming-thunks before the work can
 begin.
 It's probably better to use foldl' (and make the type strict) so you can
 start the work immediately.

 
  I have no idea where the leak is. What do you see in a .prof file to
  find a leak (hamming distance has the largest amount of time and %alloc)

 For finding leaks, heap profiling (-h*) gives more info than -p. The .prof
 says more about where you spend your time than what hangs on to memory.

   ? From my .prof file where would you start looking at?

 - use hamming instead of hamming2
 - convertIntToDouble looks suspicious
 - calculating a few million Hamming distances takes some time, but what
 about getMyStats, should that really take 25%?


filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs -
exampl, ys - exampl]
   

 filter (/= 0) [hamming xs ys | xs - example, ys - example]

 And of course, you can trivially avoid half of the work.

 
 
  Best Regards,
 
  Arnoldo Muller
 
  On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer
 daniel.is.fisc...@web.dewrote:
   Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
Hello all:
   
I want to generate some hamming distance statistics about a set of
strings. As explained in another e-mail in this list, I used the
following code to call the
functions:
(exampl holds the list of strings of size w)
filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs -
exampl, ys - exampl]
   
I have two hamming functions:
-- hamming distance for variable length strings
hamming :: String - String - Int
hamming x y = hamming' x y 0
where
  hamming' [] _ !c = c
  hamming' _ [] !c = c
  hamming' (x:xs) (y:ys) !c
   
  | x == y = hamming' xs ys c
  | otherwise = hamming' xs ys (c + 1)
   
-- function posted in this mailing list
hamming2 :: String - String - Int
hamming2 xs ys = length (filter not (zipWith (==) xs ys))
   
I am executing these functions millions of times and the bottleneck
of my program is in them as explained by running in profiling mode
with +RTS -K400M -p -RTS
   
The costlier function is the hamming distance
COST CENTREMODULE   %time %alloc
   
hammingDistances 66.6   41.9
   
It says that it is performing 41% of the allocations. In the case of
hamming2 the allocations go as far as 52%.
  
   Allocations are cheap, so that's not necessarily a problem. More
   important is, what's the maximum residency and how much is copied
   during GC? Are you compiling with -O2 ?
  
I could understand that
there are allocations in hamming2 because we are creating pairs,
but in the case of hamming there should be no allocation.
  
   Why not? I don't know how GHC counts allocations, but everytime you go
   from (x:xs) to xs, you need a new pointer to the tail. If that counts
   as allocation, hamming must allocate a lot, too.
  
How can I execute my hamming functions without allocating memory?
   
Best regards,
   
Arnoldo Muller


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman

[Haskell-cafe] hamming distance allocation

2010-04-18 Thread Arnoldo Muller
Hello all:

I want to generate some hamming distance statistics about a set of strings.
As explained in another e-mail in this list, I used the following code to
call the
functions:
(exampl holds the list of strings of size w)
filter (\x - x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs - exampl, ys
- exampl]

I have two hamming functions:
-- hamming distance for variable length strings
hamming :: String - String - Int
hamming x y = hamming' x y 0
where
  hamming' [] _ !c = c
  hamming' _ [] !c = c
  hamming' (x:xs) (y:ys) !c
  | x == y = hamming' xs ys c
  | otherwise = hamming' xs ys (c + 1)

-- function posted in this mailing list
hamming2 :: String - String - Int
hamming2 xs ys = length (filter not (zipWith (==) xs ys))

I am executing these functions millions of times and the bottleneck of my
program is in them as explained by running in profiling mode with  +RTS
-K400M -p -RTS

The costlier function is the hamming distance
COST CENTREMODULE   %time %alloc

hammingDistances 66.6   41.9

It says that it is performing 41% of the allocations. In the case of
hamming2 the allocations go as far as 52%.  I could understand that there
are allocations in hamming2 because we are creating pairs, but in the case
of hamming there should be no allocation.

How can I execute my hamming functions without allocating memory?

Best regards,

Arnoldo Muller
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Performance question

2010-03-20 Thread Arnoldo Muller
Hello Daniel,

Regarding your solution, can I apply {-# SPECIALISE ... #-} statements to
datatypes I define?
And if so, I am not able to import the datatypes to the module where
binarySearch is.
The problem is that if I import them a circular dependency is detected and
the compiler gives an error.
Is there a way of importing a datatype from another module do avoid this
circular dependency?

Thank you,

Arnoldo

On Thu, Mar 18, 2010 at 10:48 PM, Daniel Fischer
daniel.is.fisc...@web.dewrote:

 Am Donnerstag 18 März 2010 21:57:34 schrieb Daniel Fischer:
 
  Contrary to my expectations, however, using unboxed arrays is slower
  than straight arrays (in my tests).
 

 However, a few {-# SPECIALISE #-} pragmas set the record straight.
 Specialising speeds up both, boxed and unboxed arrays, significantly, but
 now, for the specialised types, unboxed arrays are faster (note, however,
 that when the code for the binary search is in the same module as it is
 used, with optimisations, GHC will probably specialise it itself. If
 binarySearch is not exported, AFAIK, you can delete probably.).

 {-# LANGUAGE BangPatterns #-}
 module SATBinSearch (binarySearch) where

 import Data.Array.IArray
 import Data.Array.Base (unsafeAt)
 import Data.Bits

 {-# SPECIALISE binarySearch :: Double - Array Int Double - Int #-}
 {-# SPECIALISE binarySearch :: Int - Array Int Int - Int #-}
 {-# SPECIALISE binarySearch :: Bool - Array Int Bool - Int #-}
 {-# SPECIALISE binarySearch :: Char - Array Int Char - Int #-}
 {-# SPECIALISE binarySearch :: Float - Array Int Float - Int #-}
 binarySearch :: Ord a = a - Array Int a - Int
 binarySearch q a = go l h
  where
(l,h) = bounds a
go !lo !hi
| hi  lo   = -(lo+1)
| otherwise = case compare mv q of
LT - go (m+1) hi
EQ - m
GT - go lo (m-1)
  where
 -- m = lo + (hi-lo) `quot` 2
 m = (lo .. hi) + (lo `xor` hi) `shiftR` 1
mv = a `unsafeAt` m

 Use Data.Array.Unboxed and UArray if possible.
 Now the bit-fiddling instead of arithmetics makes a serious difference,
 about 20% for unboxed arrays, 17% for boxed arrays (Double), so I'd
 recommend that.

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: Performance question

2010-03-19 Thread Arnoldo Muller
Thank you all, I will apply your suggestions to my function.

Thank you for making the process of learning Haskell much easier!

Arnoldo

On Fri, Mar 19, 2010 at 4:21 PM, Achim Schneider bars...@web.de wrote:

 Arnoldo Muller arnoldomul...@gmail.com wrote:

  Right now, the bottleneck of my program is in binarySearch', the
  function must be called a few billion times.
 
  Do you have any ideas on how to improve the performance of this
  function?
 
 The fastest way to do a binary search is to reify it into code using
 TH, in Van Emde Boas layout if it's a big enough search (so that you
 get less cache misses)

 This might of course get tricky if your tree isn't compile-time static,
 but if you're really doing gazillions of lookups, occasionally
 compiling+dynamically linking code might well be worth it.

 --
 (c) this sig last receiving data processing entity. Inspect headers
 for copyright history. All rights reserved. Copying, hiring, renting,
 performance and/or quoting of this signature prohibited.


 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Performance question

2010-03-18 Thread Arnoldo Muller
Hello!

I am trying to implement a binary search function that returns the index of
an
exact or the (index + 1) where the item should be inserted in an array if
the item to be searched is not found (I am not trying to insert data in the
array) .

Right now, the bottleneck of my program is in binarySearch', the function
must be called a few billion times.

Do you have any ideas on how to improve the performance of this function?

import Data.Array.IArray

type IntArray a = Array Int a

-- The array must be 0 indexed.
binarySearch :: Ord a =  a -  IntArray a  - Int
binarySearch query array =
let (low, high) = bounds array
in
   binarySearch' query array low high


binarySearch' :: Ord a =  a -  IntArray a - Int - Int - Int
binarySearch' query array !low !high
| low = high = let ! mid = low + ((high - low) `div` 2)
 ! midVal = array !
mid
   in next mid midVal
| otherwise = -(low + 1)
where next mid midVal
   |  midVal  query = binarySearch' query array  (mid + 1) high
   |  midVal  query = binarySearch' query array  low  (mid - 1)
   |  otherwise = mid


Thank you!

Arnoldo Muller
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-13 Thread Arnoldo Muller
Jason,

I am trying to use haskell in the analysis of bio data. One of the main
reasons I wanted to use haskell is because lazy I/O allows you to see a
large bio-sequence as if it was a string in memory.
In order to achieve the same result in an imperative language I would have
to write lots of error-prone iterators. I saw lazy I/O as a very strong
point in favor of Haskell.

Besides the space leaks that can occur and that are a bit difficult to find
for a newbie like me, are there any other reasons to avoid Lazy I/O?

Arnoldo.

On Sat, Mar 13, 2010 at 6:46 PM, Jason Dagit da...@codersbase.com wrote:



 On Thu, Mar 11, 2010 at 3:44 PM, Arnoldo Muller 
 arnoldomul...@gmail.comwrote:

 Daniel,

 Thank you so much for helping me out with this issue!

 Thanks to all the other answers from haskel-cafe members too!

 As a newbie, I am not able to understand why zip and map would make a
 problem...

 Is there any link I could read that could help me to understand why in
 this case
 zip and map created a leak? What are some function compositions that
 should be
 avoided when doing lazy I/O?


 Actually, it's lazy I/O itself that should be avoided.

 Jason

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-11 Thread Arnoldo Muller
Daniel,

Thank you so much for helping me out with this issue!

Thanks to all the other answers from haskel-cafe members too!

As a newbie, I am not able to understand why zip and map would make a
problem...

Is there any link I could read that could help me to understand why in this
case
zip and map created a leak? What are some function compositions that should
be
avoided when doing lazy I/O?

Regards,

Arnoldo


On Thu, Mar 11, 2010 at 11:46 PM, Daniel Fischer
daniel.is.fisc...@web.dewrote:

 Am Donnerstag 11 März 2010 00:24:28 schrieb Daniel Fischer:
  Hmm, offhand, I don't see why that isn't strict enough.

 Turns out, mapM_ was a red herring. The villain was (zip and map).
 I must confess, I don't know why it sort-of worked without the mapM_,
 though. sort-of, because that also hung on to unnecessarily much memory,
 the space leak was just smaller than with the mapM_.

 A very small change that eliminates the space leak, is

 readFasta :: Int - [Char] - [Window]
 readFasta windowSize sequence =
 -- get the header
 let (header,rest) = span (/= '\n') sequence
 chr = parseChromosome header
go i (w:ws) = Window w chr i : go (i+1) ws
go _ [] = []
in go 0 $ slideWindow windowSize $ filter (/= '\n') rest

 You can improve performance by eliminating slideWindow and the intermediate
 Window list (merging fastaExtractor and readFasta),

 {-# LANGUAGE BangPatterns #-}

 readFasta2 :: (String - Bool) - Int - String
 readFasta2 test windowSize sequence =
let (header,rest) = span (/= '\n') sequence
chr = parseChromosome header
schr = show chr
go !i st@(_:tl)
| test w= w ++ '\t' : schr ++ '\t' : show i ++ '\n' : go
 (i+1) tl
| otherwise = go (i+1) tl
  where
w = take windowSize st
go _ [] = []
in go 0 (filter (/= '\n')) rest


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] Space leak

2010-03-10 Thread Arnoldo Muller


 Program A
-
If instead of the main function given above I use the following main
function to process only one input file, things work OK for even
the largest files. Memory usage remains constant in this case.

main = do
   -- get the arguments
   [input, output, windowSize] - getArgs
   -- keep the input stream
   inpStr - readFile input
   let wSize = (read windowSize)::Int
   writeFile output $ fastaExtractor inpStr wSize filterWindow


It is not easy for me to see why is Haskell keeping data in memory. Do you
have any idea why  program B is
not working?

Thank you for your help!

Arnoldo Muller
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-10 Thread Arnoldo Muller
Hello Daniel:

Thanks!
I employed mapM'_ but I am still getting the space leak.
Any other hint?



Arnoldo

On Wed, Mar 10, 2010 at 10:40 PM, Daniel Fischer
daniel.is.fisc...@web.dewrote:

 Am Mittwoch 10 März 2010 21:45:56 schrieb Arnoldo Muller:
  Hello,
 
  I am learning haskell and I found a space leak that I find difficult to
  solve. I've been asking at #haskell but we could not solve
  the issue.
 
  I want to lazily read a set of 22 files of about 200MB each, filter them
  and then I want to output the result into a unique file.
  If I modify the main function to work only with one input file,  the
  program runs without issues. I will call this version A.
  Version B  uses a mapM_ to iterate over a list of filenames and uses
  appendFile to output the result of filtering each file.
  In this case the memory usage grows sharply and quickly (profiles show
  constant memory growth). In less than a minute, memory
  occupation will make my system hang with swapping.

 No work is been done until the end, when all is tried to be done
 simultaneously. Make sure genomeExecute ... input1 has actually finished
 its work before genomeExecute ... input2 starts etc.

 One way is to use a stricter version of sequence_,

 sequence'_ :: Monad m = [m a] - m ()
 sequence'_ (x:xs) = do
a - x
a `seq` sequence'_ xs
 sequence'_ [] = return ()

 (nicer with BangPatterns, but not portable), and

 mapM'_ f = sequence'_ . map f

 Another option is making genomeExecute itself stricter.

 
  This is version B:
 
  --- Program B
  
  import Data.List
  import System.Environment
  import System.Directory
  import Control.Monad
 
 
  -- different types of chromosomes
  data Chromosome =C1
 
  | C2
  | C3
  | C4
  | C5
  | C6
  | C7
  | C8
  | C9
  | C10
  | C11
  | C12
  | C13
  | C14
  | C15
  | C16
  | C17
  | C18
  | C19
  | CX
  | CY
  | CMT
 
deriving (Show)
  -- define a window
  type Sequence = [Char]
  -- Window data
  data Window = Window { sequen :: Sequence,
 chrom :: Chromosome,
 pos   :: Int
   }
  -- print a window
  instance Show Window where
  show w =  (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos
  w)
 
  -- Reading fasta files with haskell
 
  -- Initialize the
  main = do
 -- get the arguments (intput is
 [input, output, windowSize] - getArgs
 -- get directory contents (only names)
 names - getDirectoryContents input
 -- prepend directory
 let fullNames = filter isFastaFile $ map (\x - input ++ / ++
  x) names
 let wSize = (read windowSize)::Int
 -- process the directories
 mapM (genomeExecute output wSize filterWindow)  fullNames
 
 
  -- read the files one by one and write them to the output file
  genomeExecute :: String - Int - (Window - Bool) - String - IO ()
  genomeExecute  outputFile windowSize f inputFile = do
fileData - readFile inputFile
appendFile outputFile $ fastaExtractor fileData windowSize f



___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-10 Thread Arnoldo Muller
Hello Bulat,

I ran program A with writeFile instead of appendFile and it still works
without problems.
Regarding program B, if I use writeFile the leaking still occurs.

Any other hints? :)

Arnoldo

On Wed, Mar 10, 2010 at 10:32 PM, Bulat Ziganshin bulat.zigans...@gmail.com
 wrote:

 Hello Arnoldo,

 Wednesday, March 10, 2010, 11:45:56 PM, you wrote:

  I am learning haskell and I found a space leak that I find
  difficult to solve. I've been asking at #haskell but we could not solve
  the issue.

 make some experiments - leave only one file and use version A, then
 replace appendFile with writeFile

 --
 Best regards,
  Bulatmailto:bulat.zigans...@gmail.com


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Space leak

2010-03-10 Thread Arnoldo Muller
Hello Justin,

I tried and what I saw was a constant increase in memory usage.
Any particular profiling option that you would use?

I do remember that there was a particular option in which the leak would
dissapear (for the same amount of work) and that is why I stopped using the
profiler.

Thanks,

Arnoldo


On Wed, Mar 10, 2010 at 10:20 PM, Justin Bailey jgbai...@gmail.com wrote:

 Have you use the profiling tools available with GHC?

  http://haskell.org/ghc/docs/latest/html/users_guide/profiling.html


 On Wed, Mar 10, 2010 at 12:45 PM, Arnoldo Muller
 arnoldomul...@gmail.com wrote:
  Hello,
 
  I am learning haskell and I found a space leak that I find difficult to
  solve. I've been asking at #haskell but we could not solve
  the issue.
 
  I want to lazily read a set of 22 files of about 200MB each, filter them
 and
  then I want to output the result into a unique file.
  If I modify the main function to work only with one input file,  the
 program
  runs without issues. I will call this version A.
  Version B  uses a mapM_ to iterate over a list of filenames and uses
  appendFile to output the result of filtering each file.
  In this case the memory usage grows sharply and quickly (profiles show
  constant memory growth). In less than a minute, memory
  occupation will make my system hang with swapping.
 
  This is version B:
 
  --- Program B
 
 
  import Data.List
  import System.Environment
  import System.Directory
  import Control.Monad
 
 
  -- different types of chromosomes
  data Chromosome =C1
  | C2
  | C3
  | C4
  | C5
  | C6
  | C7
  | C8
  | C9
  | C10
  | C11
  | C12
  | C13
  | C14
  | C15
  | C16
  | C17
  | C18
  | C19
  | CX
  | CY
  | CMT
deriving (Show)
  -- define a window
  type Sequence = [Char]
  -- Window data
  data Window = Window { sequen :: Sequence,
 chrom :: Chromosome,
 pos   :: Int
   }
  -- print a window
  instance Show Window where
  show w =  (sequen w) ++ \t ++ show (chrom w) ++ \t ++ show (pos
 w)
 
  -- Reading fasta files with haskell
 
  -- Initialize the
  main = do
 -- get the arguments (intput is
 [input, output, windowSize] - getArgs
 -- get directory contents (only names)
 names - getDirectoryContents input
 -- prepend directory
 let fullNames = filter isFastaFile $ map (\x - input ++ / ++ x)
  names
 let wSize = (read windowSize)::Int
 -- process the directories
 mapM (genomeExecute output wSize filterWindow)  fullNames
 
 
  -- read the files one by one and write them to the output file
  genomeExecute :: String - Int - (Window - Bool) - String - IO ()
  genomeExecute  outputFile windowSize f inputFile = do
fileData - readFile inputFile
appendFile outputFile $ fastaExtractor fileData windowSize f
 
  --
  isFastaFile :: String - Bool
  isFastaFile fileName = isSuffixOf .fa fileName
 
 
  -- fasta extractor (receives a Fasta String and returns a windowed string
  ready to be sorted)
  -- an example on how to compose several functions to parse a fasta file
  fastaExtractor :: String - Int - (Window - Bool) - String
  fastaExtractor input wSize f = printWindowList $ filter f $ readFasta
 wSize
  input
 
  -- MAIN FILTER that removes N elements from the strings!
  filterWindow :: Window - Bool
  filterWindow w = not (elem 'N' (sequen w))
 
  -- print a window list (the printing makes it ready for output as raw
 data)
  printWindowList :: [Window] - String
  printWindowList l = unlines $ map show l
 
  -- read fasta, remove stuff that is not useful from it
  -- removes the
  readFasta :: Int - [Char] - [Window]
  readFasta windowSize sequence =
  -- get the header
  let (header:rest) = lines sequence
  chr = parseChromosome header
  in
 
  -- We now do the following:
  --  take window  create counter
  remove newlines
 map (\(i, w) - Window w chr i) $ zip [0..]  $ slideWindow windowSize
 $
  filter ( '\n' /= )  $ unlines rest
 
 
  slideWindow :: Int - [Char] - [[Char]]
  slideWindow _ [] = []
  slideWindow windowSize l@(_:xs)  = take windowSize l : slideWindow
  windowSize xs
 
 
 
  -- Parse the chromosome from a fasta comment
  -- produce a more compact chromosome representation
  parseChromosome :: [Char] - Chromosome
  parseChromosome line
  | isInfixOf chromosome 1, line = C1
  | isInfixOf chromosome 2, line = C2
  | isInfixOf chromosome 3, line

Re: [Haskell-cafe] Space leak

2010-03-10 Thread Arnoldo Muller
Bulat,

The same happens, the memory starts to quickly fill up...

Arnoldo

On Wed, Mar 10, 2010 at 11:16 PM, Bulat Ziganshin bulat.zigans...@gmail.com
 wrote:

 Hello Arnoldo,

 Wednesday, March 10, 2010, 11:45:56 PM, you wrote:

  I am learning haskell and I found a space leak that I find
  difficult to solve. I've been asking at #haskell but we could not solve
  the issue.

 what if you use program B on single file?


 --
 Best regards,
  Bulatmailto:bulat.zigans...@gmail.com


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe