On Jul 25, 2015, at 10:43 AM, Nicholas Ingolia <[email protected]> wrote:
>
> I would suggest opening all files at the beginning. Generate all k-mers,
> open a file for each, and make a HashMap from k-mer to file handle. When
> you're done, run through the HashMap to close all file handles.
>
> Also, this lets you keep your sequences/k-mers/hash keys as byte strings
> for iterating reads, and convert only during the initial file opening
> pass.
I've tried to implement most of these suggestions in the code below except for
the one to hash all the k-mers first because I'm working with multi-gigabyte
files. I can't fit that much data into memory. Mostly this seems like a great
approach, but, when I run it, I get this exception:
*** Exception: out/TGAAC: openFile: resource busy (file is locked)
Is this because the map parallelizes the file writing such that there's a race
condition for the file handles?
Ken
import Bio.Core.Sequence
import Bio.Sequence.Fasta
import Control.Monad
import qualified Data.ByteString.Lazy.Char8 as B
import qualified Data.HashMap as Map
import System.IO
import System.Environment
main :: IO ()
main = do
[f] <- getArgs
reads <- readFasta f
let kmers = concatMap (findKmers 20 . unSD . seqdata) reads
let allMers = replicateM 5 "ACTG"
let fileHandles = Map.fromList $
map (\x -> (x, openFile ("out/" ++ x) WriteMode)) allMers
mapM_ (printMer fileHandles) kmers
mapM_ (\ioh -> do { h <- ioh; hClose h }) $ Map.elems fileHandles
putStrLn "Done."
-- # --------------------------------------------------
findKmers :: Integer -> B.ByteString -> [B.ByteString]
findKmers k xs = findKmers' n k xs
where n = toInteger (B.length xs) - k + 1
findKmers' n' k' xs'
| n' > 0 = B.take (fromIntegral k') xs'
: findKmers' (n' - 1) k' (B.tail xs')
| otherwise = []
-- # --------------------------------------------------
printMer :: Map.Map String (IO Handle) -> B.ByteString -> IO ()
printMer fileHandles mer = do
let bin = toString (B.take 5 mer)
let handle = Map.lookup bin fileHandles
case handle of
Nothing -> putStrLn $ "Missing " ++ bin ++ " handle"
Just ioh -> do h <- ioh
B.hPutStrLn h (B.drop 5 mer)