Re: [Biohaskell] Fwd: Trouble reading .phd files that have extra tags

Dan Fornika Wed, 01 Jun 2011 09:38:55 -0700

Thanks for the quick reply.

You may recall that I'm working on an hxt-based parser for polyphred XML
output.  My intention is to eventually be able to compare tags in the
.phd files with those in the polyphred output, so down the road I'd like
to do something useful with the full .phd files.  But for now, I just
want that readPhd function to return a useful value.


I'm going to try to do some version of that quick patch you suggested.
Rewriting the module based on another library is beyond my skill level
right now.

If it works, this will be my first patch to an open-source project, wish
me luck!

Dan

p.s. polyphred-xml (such as it is) is hosted at:

https://patch-tag.com/r/dfornika/polyphred-xml/home

Once it's a usable module, I'll throw it up on hackage.

On 11-06-01 04:59 AM, Ketil Malde wrote:
> Dan Fornika <dforn...@gmail.com> writes:
> 
>> I've been working on using Bio.Sequence.Phd to read some .phd files.
> 
> Great!
> 
>> When I load up the module in ghci and try to apply readPhd to a .phd
>> file, I keep getting an error that says:
>>
>> *** Exception: failed to parse quality value
> 
> Not so much... :-(
> 
>> I've noticed that in the Bio.Sequence.Phd source, there is a short note
>> about adding checking for "BEGIN_DNA and END_DNA."
> 
> Yes, the PHD parser is quite rudimentary, and only tested on Phred's
> output.  I then discovered that Phred can generate fasta and qual files
> directly, so I hardly use it any more.
> 
>> Can anyone give me some hints on how I or someone else can patch up the
>> code, just so that it doesn't fail when those tags are present? 
>> Eventually it would be nice if the tags could be read too but I think
>> that can wait for another email.
> 
> If you'd like to try to fix this yourself, that's great - please send me
> a patch! This is what the code looks like (indented, my comments
> interspersed without indentation): 
> 
>     mkPhd :: B.ByteString -> (Sequence Nuc)
>     mkPhd inp = 
>       let (hd:fs) = filter (not . B.null) . B.lines $ inp
>           (comment,sd) = break (==B.pack "BEGIN_DNA") fs
>           (magic,label) = B.splitAt 15 hd
>           more_magic = magic == B.pack "BEGIN_SEQUENCE "
> 
> At this point we have the first line split into "magic" and "label", and
> a check for the right magic.  The rest is split into a comment and the
> sequence contents (using the identifier "BEGIN_DNA").
> 
>           fields = B.words . B.unlines
>                    . filter (not . isSubstr (B.pack "_COMMENT")) $ comment
>           sdata = filter ((==3).length) . map B.words $ sd
> 
> Oh.  "sdata" is anything after coment that has three (white-space
> separated) values.  Maybe "takeWhile" would be a better alternative to
> "filter" here?  At least that way, you'd stop parsing when this fails.
> 
> And, of course, something like "takeWhile (/=B.pack "END_DNA")" might be
> even more robust.  Either would probably fix your issue.
> 
>           err = error "failed to parse quality value"
>           qual = BB.fromChunks [BBB.pack . map (maybe err (fromIntegral . 
> fst) . B.readInt . (!!1)) $ sdata]
>       in if more_magic then qual `seq` (Seq (compact $ B.unwords 
> (label:fields)) 
>                                         (compact $ B.concat $ map head sdata)
>                                         (Just qual))
>          else error "Incorrectly formatted PHD file - missing BEGIN_SEQUENCE"
>          --   Todo: also check that we have a BEGIN_DNA/END_DNA region there.
> 
> Of course, this is a very hackish parser, slicing and dicing a
> ByteString input.  It really should be rewritten using some nice parser
> combinators.  The ACE parser (Bio.Alignment.ACE) uses Parsec on
> tokenized ByteString input, that could be one option.  Or you could
> probably use attoparsec directly - but it'd add a dependency.
> 
> -k

_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://malde.org/cgi-bin/mailman/listinfo/biohaskell

Re: [Biohaskell] Fwd: Trouble reading .phd files that have extra tags

Reply via email to