Dan Fornika <dforn...@gmail.com> writes: > I've been working on using Bio.Sequence.Phd to read some .phd files.
Great! > When I load up the module in ghci and try to apply readPhd to a .phd > file, I keep getting an error that says: > > *** Exception: failed to parse quality value Not so much... :-( > I've noticed that in the Bio.Sequence.Phd source, there is a short note > about adding checking for "BEGIN_DNA and END_DNA." Yes, the PHD parser is quite rudimentary, and only tested on Phred's output. I then discovered that Phred can generate fasta and qual files directly, so I hardly use it any more. > Can anyone give me some hints on how I or someone else can patch up the > code, just so that it doesn't fail when those tags are present? > Eventually it would be nice if the tags could be read too but I think > that can wait for another email. If you'd like to try to fix this yourself, that's great - please send me a patch! This is what the code looks like (indented, my comments interspersed without indentation): mkPhd :: B.ByteString -> (Sequence Nuc) mkPhd inp = let (hd:fs) = filter (not . B.null) . B.lines $ inp (comment,sd) = break (==B.pack "BEGIN_DNA") fs (magic,label) = B.splitAt 15 hd more_magic = magic == B.pack "BEGIN_SEQUENCE " At this point we have the first line split into "magic" and "label", and a check for the right magic. The rest is split into a comment and the sequence contents (using the identifier "BEGIN_DNA"). fields = B.words . B.unlines . filter (not . isSubstr (B.pack "_COMMENT")) $ comment sdata = filter ((==3).length) . map B.words $ sd Oh. "sdata" is anything after coment that has three (white-space separated) values. Maybe "takeWhile" would be a better alternative to "filter" here? At least that way, you'd stop parsing when this fails. And, of course, something like "takeWhile (/=B.pack "END_DNA")" might be even more robust. Either would probably fix your issue. err = error "failed to parse quality value" qual = BB.fromChunks [BBB.pack . map (maybe err (fromIntegral . fst) . B.readInt . (!!1)) $ sdata] in if more_magic then qual `seq` (Seq (compact $ B.unwords (label:fields)) (compact $ B.concat $ map head sdata) (Just qual)) else error "Incorrectly formatted PHD file - missing BEGIN_SEQUENCE" -- Todo: also check that we have a BEGIN_DNA/END_DNA region there. Of course, this is a very hackish parser, slicing and dicing a ByteString input. It really should be rewritten using some nice parser combinators. The ACE parser (Bio.Alignment.ACE) uses Parsec on tokenized ByteString input, that could be one option. Or you could probably use attoparsec directly - but it'd add a dependency. -k -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://malde.org/cgi-bin/mailman/listinfo/biohaskell