On Jul 25, 2015, at 10:43 AM, Nicholas Ingolia <n...@ingolia.org> wrote: > > I would suggest opening all files at the beginning. Generate all k-mers, > open a file for each, and make a HashMap from k-mer to file handle. When > you're done, run through the HashMap to close all file handles. > > Also, this lets you keep your sequences/k-mers/hash keys as byte strings > for iterating reads, and convert only during the initial file opening > pass.
I've tried to implement most of these suggestions in the code below except for the one to hash all the k-mers first because I'm working with multi-gigabyte files. I can't fit that much data into memory. Mostly this seems like a great approach, but, when I run it, I get this exception: *** Exception: out/TGAAC: openFile: resource busy (file is locked) Is this because the map parallelizes the file writing such that there's a race condition for the file handles? Ken import Bio.Core.Sequence import Bio.Sequence.Fasta import Control.Monad import qualified Data.ByteString.Lazy.Char8 as B import qualified Data.HashMap as Map import System.IO import System.Environment main :: IO () main = do [f] <- getArgs reads <- readFasta f let kmers = concatMap (findKmers 20 . unSD . seqdata) reads let allMers = replicateM 5 "ACTG" let fileHandles = Map.fromList $ map (\x -> (x, openFile ("out/" ++ x) WriteMode)) allMers mapM_ (printMer fileHandles) kmers mapM_ (\ioh -> do { h <- ioh; hClose h }) $ Map.elems fileHandles putStrLn "Done." -- # -------------------------------------------------- findKmers :: Integer -> B.ByteString -> [B.ByteString] findKmers k xs = findKmers' n k xs where n = toInteger (B.length xs) - k + 1 findKmers' n' k' xs' | n' > 0 = B.take (fromIntegral k') xs' : findKmers' (n' - 1) k' (B.tail xs') | otherwise = [] -- # -------------------------------------------------- printMer :: Map.Map String (IO Handle) -> B.ByteString -> IO () printMer fileHandles mer = do let bin = toString (B.take 5 mer) let handle = Map.lookup bin fileHandles case handle of Nothing -> putStrLn $ "Missing " ++ bin ++ " handle" Just ioh -> do h <- ioh B.hPutStrLn h (B.drop 5 mer)