On Jul 25, 2015, at 10:43 AM, Nicholas Ingolia <n...@ingolia.org> wrote:
> 
> I would suggest opening all files at the beginning. Generate all k-mers,
> open a file for each, and make a HashMap from k-mer to file handle. When
> you're done, run through the HashMap to close all file handles. 
> 
> Also, this lets you keep your sequences/k-mers/hash keys as byte strings
> for iterating reads, and convert only during the initial file opening
> pass. 

I've tried to implement most of these suggestions in the code below except for 
the one to hash all the k-mers first because I'm working with multi-gigabyte 
files.  I can't fit that much data into memory.  Mostly this seems like a great 
approach, but, when I run it, I get this exception:

*** Exception: out/TGAAC: openFile: resource busy (file is locked)

Is this because the map parallelizes the file writing such that there's a race 
condition for the file handles?

Ken

import Bio.Core.Sequence
import Bio.Sequence.Fasta
import Control.Monad
import qualified Data.ByteString.Lazy.Char8 as B
import qualified Data.HashMap as Map
import System.IO
import System.Environment

main :: IO ()
main = do
  [f] <- getArgs
  reads <- readFasta f
  let kmers       = concatMap (findKmers 20 . unSD . seqdata) reads
  let allMers     = replicateM 5 "ACTG"
  let fileHandles = Map.fromList $
                    map (\x -> (x, openFile ("out/" ++ x) WriteMode)) allMers

  mapM_ (printMer fileHandles) kmers
  mapM_ (\ioh -> do { h <- ioh; hClose h }) $ Map.elems fileHandles
  putStrLn "Done."

-- # --------------------------------------------------
findKmers :: Integer -> B.ByteString -> [B.ByteString]
findKmers k xs = findKmers' n k xs
  where n = toInteger (B.length xs) - k + 1
        findKmers' n' k' xs'
          | n' > 0 = B.take (fromIntegral k') xs'
             : findKmers' (n' - 1) k' (B.tail xs')
          | otherwise = []

-- # --------------------------------------------------
printMer :: Map.Map String (IO Handle) -> B.ByteString -> IO ()
printMer fileHandles mer = do
  let bin    = toString (B.take 5 mer)
  let handle = Map.lookup bin fileHandles
  case handle of
    Nothing  -> putStrLn $ "Missing " ++ bin ++ " handle"
    Just ioh -> do h <- ioh
                   B.hPutStrLn h (B.drop 5 mer)

Reply via email to