Re: [Biohaskell] Fwd: Trouble reading .phd files that have extra tags

Ketil Malde Wed, 01 Jun 2011 04:59:54 -0700

Dan Fornika <dforn...@gmail.com> writes:

> I've been working on using Bio.Sequence.Phd to read some .phd files.


Great!

> When I load up the module in ghci and try to apply readPhd to a .phd
> file, I keep getting an error that says:
>
> *** Exception: failed to parse quality value

Not so much... :-(

> I've noticed that in the Bio.Sequence.Phd source, there is a short note
> about adding checking for "BEGIN_DNA and END_DNA."

Yes, the PHD parser is quite rudimentary, and only tested on Phred's
output.  I then discovered that Phred can generate fasta and qual files
directly, so I hardly use it any more.

> Can anyone give me some hints on how I or someone else can patch up the
> code, just so that it doesn't fail when those tags are present? 
> Eventually it would be nice if the tags could be read too but I think
> that can wait for another email.

If you'd like to try to fix this yourself, that's great - please send me
a patch! This is what the code looks like (indented, my comments
interspersed without indentation): 

    mkPhd :: B.ByteString -> (Sequence Nuc)
    mkPhd inp = 
      let (hd:fs) = filter (not . B.null) . B.lines $ inp
          (comment,sd) = break (==B.pack "BEGIN_DNA") fs
          (magic,label) = B.splitAt 15 hd
          more_magic = magic == B.pack "BEGIN_SEQUENCE "

At this point we have the first line split into "magic" and "label", and
a check for the right magic.  The rest is split into a comment and the
sequence contents (using the identifier "BEGIN_DNA").

          fields = B.words . B.unlines
                   . filter (not . isSubstr (B.pack "_COMMENT")) $ comment
          sdata = filter ((==3).length) . map B.words $ sd

Oh.  "sdata" is anything after coment that has three (white-space
separated) values.  Maybe "takeWhile" would be a better alternative to
"filter" here?  At least that way, you'd stop parsing when this fails.

And, of course, something like "takeWhile (/=B.pack "END_DNA")" might be
even more robust.  Either would probably fix your issue.

          err = error "failed to parse quality value"
          qual = BB.fromChunks [BBB.pack . map (maybe err (fromIntegral . fst) 
. B.readInt . (!!1)) $ sdata]
      in if more_magic then qual `seq` (Seq (compact $ B.unwords 
(label:fields)) 
                                        (compact $ B.concat $ map head sdata)
                                        (Just qual))
         else error "Incorrectly formatted PHD file - missing BEGIN_SEQUENCE"
         --   Todo: also check that we have a BEGIN_DNA/END_DNA region there.

Of course, this is a very hackish parser, slicing and dicing a
ByteString input.  It really should be rewritten using some nice parser
combinators.  The ACE parser (Bio.Alignment.ACE) uses Parsec on
tokenized ByteString input, that could be one option.  Or you could
probably use attoparsec directly - but it'd add a dependency.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://malde.org/cgi-bin/mailman/listinfo/biohaskell

Re: [Biohaskell] Fwd: Trouble reading .phd files that have extra tags

Reply via email to