Re: [Biohaskell] Fwd: Trouble reading .phd files that have extra tags

Dan Fornika Wed, 01 Jun 2011 16:07:32 -0700

Hi,

I've tried simply dropping in 'takeWhile' in place of 'filter' in that
sdata assignment, and it does bypass the error.


BUT... When I run readPhd from ghci now, I get a IO (Sequence Nuc) with
the ID and Comment intact, but the sequence info empty.

I'm a bit surprised, because the type signatures of takeWhile and filter
are the same.  I've tried replacing (==3).length with something a little
more generous, like (/=0).length but there is no change in the output.

Would someone mind taking another look at the next couple of lines:

qual = BB.fromChunks [BBB.pack . map (maybe err (fromIntegral . fst) .
B.readInt . (!!1)) $ sdata]
in  if more_magic then qual `seq` (Seq (compact $ B.unwords
(label:fields) )
                                  (compact $ B.concat $ map head sdata)
                                  (Just qual))

(unmodified from the original source) and see if there is some reason
that takeWhile will cause the sequence info not to be passed in properly?

Thanks

On 11-06-01 04:59 AM, Ketil Malde wrote:
> Dan Fornika <dforn...@gmail.com> writes:
> 
>> I've been working on using Bio.Sequence.Phd to read some .phd files.
> 
> Great!
> 
>> When I load up the module in ghci and try to apply readPhd to a .phd
>> file, I keep getting an error that says:
>>
>> *** Exception: failed to parse quality value
> 
> Not so much... :-(
> 
>> I've noticed that in the Bio.Sequence.Phd source, there is a short note
>> about adding checking for "BEGIN_DNA and END_DNA."
> 
> Yes, the PHD parser is quite rudimentary, and only tested on Phred's
> output.  I then discovered that Phred can generate fasta and qual files
> directly, so I hardly use it any more.
> 
>> Can anyone give me some hints on how I or someone else can patch up the
>> code, just so that it doesn't fail when those tags are present? 
>> Eventually it would be nice if the tags could be read too but I think
>> that can wait for another email.
> 
> If you'd like to try to fix this yourself, that's great - please send me
> a patch! This is what the code looks like (indented, my comments
> interspersed without indentation): 
> 
>     mkPhd :: B.ByteString -> (Sequence Nuc)
>     mkPhd inp = 
>       let (hd:fs) = filter (not . B.null) . B.lines $ inp
>           (comment,sd) = break (==B.pack "BEGIN_DNA") fs
>           (magic,label) = B.splitAt 15 hd
>           more_magic = magic == B.pack "BEGIN_SEQUENCE "
> 
> At this point we have the first line split into "magic" and "label", and
> a check for the right magic.  The rest is split into a comment and the
> sequence contents (using the identifier "BEGIN_DNA").
> 
>           fields = B.words . B.unlines
>                    . filter (not . isSubstr (B.pack "_COMMENT")) $ comment
>           sdata = filter ((==3).length) . map B.words $ sd
> 
> Oh.  "sdata" is anything after coment that has three (white-space
> separated) values.  Maybe "takeWhile" would be a better alternative to
> "filter" here?  At least that way, you'd stop parsing when this fails.
> 
> And, of course, something like "takeWhile (/=B.pack "END_DNA")" might be
> even more robust.  Either would probably fix your issue.
> 
>           err = error "failed to parse quality value"
>           qual = BB.fromChunks [BBB.pack . map (maybe err (fromIntegral . 
> fst) . B.readInt . (!!1)) $ sdata]
>       in if more_magic then qual `seq` (Seq (compact $ B.unwords 
> (label:fields)) 
>                                         (compact $ B.concat $ map head sdata)
>                                         (Just qual))
>          else error "Incorrectly formatted PHD file - missing BEGIN_SEQUENCE"
>          --   Todo: also check that we have a BEGIN_DNA/END_DNA region there.
> 
> Of course, this is a very hackish parser, slicing and dicing a
> ByteString input.  It really should be rewritten using some nice parser
> combinators.  The ACE parser (Bio.Alignment.ACE) uses Parsec on
> tokenized ByteString input, that could be one option.  Or you could
> probably use attoparsec directly - but it'd add a dependency.
> 
> -k

_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://malde.org/cgi-bin/mailman/listinfo/biohaskell

Re: [Biohaskell] Fwd: Trouble reading .phd files that have extra tags

Reply via email to