Re: [Biohaskell] Fwd: Trouble reading .phd files that have extra tags

Ketil Malde Thu, 02 Jun 2011 05:23:21 -0700

Dan Fornika <dforn...@gmail.com> writes:

> BUT... When I run readPhd from ghci now, I get a IO (Sequence Nuc) with
> the ID and Comment intact, but the sequence info empty.


Okay.

> I'm a bit surprised, because the type signatures of takeWhile and filter
> are the same.  I've tried replacing (==3).length with something a little
> more generous, like (/=0).length but there is no change in the output.

Hm, so could there be that there is an empty line immediately after
BEGIN_DNA?  Perhaps the safe alternative would be something like

   (dna,rest) = break (==B.pack "BEGIN_DNA") sd
   sdata = filter ((==3).length) . map B.words $ dna

> Would someone mind taking another look at the next couple of lines:

Not at all.

> qual = BB.fromChunks [BBB.pack . map (maybe err (fromIntegral . fst) . 
> B.readInt . (!!1)) $ sdata]

So - this generates the quality values (which is a bytestring of Word8s)
from reading (readInt) the second word (!!1) on each line, and applying
fromIntegral to convert each of them to Word8.  Since readInt returns
Maybe (number,rest_of_bytestring), 'maybe err' is used to either extract
the number, or call 'err'.

It looks kind of noisy, but not too complicated, really.

> in  if more_magic then qual `seq` (Seq (compact $ B.unwords (label:fields) )

Checks the magic number, evaluates the quality values (don't want thunks
pointing into the file to hang around), and builds a Seq structure where
the header is the read label and additional fields from the PHD file

>                                   (compact $ B.concat $ map head sdata)

Concatenates the sequence itself, which comes as the (single letter)
words that make up the first column (map head), compact makes sure this
is stored as a contigous lazy bytestring, not a list of single-letter
chunks. 

>                                   (Just qual))

And, yes, don't forget the quality values.

> (unmodified from the original source) and see if there is some reason
> that takeWhile will cause the sequence info not to be passed in properly?

No, this stuff works as advertised, the problem is clipping out the
region that contains the sequence and quality data (and, IIRC, exact
position in the chromatogram?  Which we ignore anyway :-)

Thanks for working on fixing this - let me know if there's anything else
I can do to help you along.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Biohaskell mailing list
Biohaskell@biohaskell.org
http://malde.org/cgi-bin/mailman/listinfo/biohaskell

Re: [Biohaskell] Fwd: Trouble reading .phd files that have extra tags

Reply via email to