Hi Ken,
I'd suggest to use ByteString instead of String with Sequence at all times
- it is much more efficient, and uses less memory.
Just:
import Data.ByteString.Char8 as BS
and instead of findKMers 20 . toString, rewrite findKMersBS 20 with the
following sig:
findKMers :: Int -> [BS.ByteString] -> [BS.ByteString]
findKmers :: Int -> BS.ByteString -> [BS.ByteString]
findKmers k xs = findKmers' n k xs
where n = BS.length xs - k + 1
findKmers' n' k' xs'
| n' > 0 = BS.take k' xs' : findKmers' (n' - 1) k' (BS.tail xs')
| otherwise = []
This step critically decreased amount of used memory in my case, letting
the code to finish in 3 mins.
real 3m11.755s
user 1m46.364s
sys 1m25.280s
Of course now it take 50% of the time opening and closing files... :-)
--
Cheers
Michal
On Sun, Jul 26, 2015 at 11:34 AM, Youens-Clark, Charles Kenneth - (kyclark)
<[email protected]> wrote:
> On Jul 25, 2015, at 10:43 AM, Nicholas Ingolia <[email protected]> wrote:
> >
> > I would suggest opening all files at the beginning. Generate all k-mers,
> > open a file for each, and make a HashMap from k-mer to file handle. When
> > you're done, run through the HashMap to close all file handles.
>
> I really like this idea and feel I'm close to having something that
> works. Here's a bit:
>
> main = do
> reads <- readFasta "test.fa"
> let kmers = concatMap (findKmers 20 . toString . seqdata) reads
> let allMers = replicateM 5 "ACTG"
> let fileHandles = Map.fromList $
> map (\x -> (x, openFile ("out/" ++ x) WriteMode))
> allMers
>
> mapM_ (printMer fileHandles) kmers
>
> mapM_ hClose $ Map.elems fileHandles
>
> Here "fileHandles" type is:
>
> fileHandles :: Map.Map [Char] (IO Handle)
>
> But it would be much easier if the elems were just Handle's. Is there a
> way to do this? I tried this:
>
> let fileHandles = Map.fromList $
> map (\x -> do h <- openFile ("out/" ++ x) WriteMode
> (x, h))
> allMers
>
> I'm still really hung up on the whole IO monad thing.
>
> Ken
--
Pozdrawiam
Michał