Aha! I think I've got it! Look at how the 'break' function works; here is a ghci session I just worked on (just shortened "Data.Bytestring.Lazy.Char8" to "B" for brevity and clarity):
-- first create a list of bytestrings ("lobs") to work with: let lobs = map B.pack $ ["HELLO", "My name is Monster Brains", "BEGIN_DNA", "HELLO_AGAIN", "Don't be afraid"] break (==B.pack "BEGIN_DNA") lobs -- evaluates to: ([Chunk "HELLO" Empty,Chunk "My name is Monster Brains" Empty],[Chunk "BEGIN_DNA" Empty,Chunk "HELLO_AGAIN" Empty,Chunk "Don't be afraid" Empty]) Note that the "BEGIN_DNA" bytestring is the first element of the second list. When this happens in the Bio.Sequence.Phd code, it creates a list of bytestrings that is like: (comment,sd) = ["BLAH BLAH COMMENT STUFF", "BEGIN_DNA", "a 20 100", "t 25 103", ...] And since "BEGIN_DNA" has length == 1, takeWhile ((==3).length) stops there. I fixed this by switching line 32 of Phd.hs from: (comment,sd) = break (==B.pack "BEGIN_DNA") fs to this: (comment,bdna:sd) = break (==B.pack "BEGIN_DNA") fs where "bdna" is just meant to capture that "BEGIN_DNA" bytestring. When I compile, ghc complains that bdna is defined and not used, but that is sort of what I intend for now. Other than that, it works well enough for my purposes. I know this is a really minor fix, but would you mind taking me through the normal process of submitting a patch? I just want to learn how this works. Do people normally just suggest things for you to change via email, and then you integrate it yourself? Dan On 11-06-02 05:23 AM, Ketil Malde wrote: > Dan Fornika <dforn...@gmail.com> writes: > >> BUT... When I run readPhd from ghci now, I get a IO (Sequence Nuc) with >> the ID and Comment intact, but the sequence info empty. > > Okay. > >> I'm a bit surprised, because the type signatures of takeWhile and filter >> are the same. I've tried replacing (==3).length with something a little >> more generous, like (/=0).length but there is no change in the output. > > Hm, so could there be that there is an empty line immediately after > BEGIN_DNA? Perhaps the safe alternative would be something like > > (dna,rest) = break (==B.pack "BEGIN_DNA") sd > sdata = filter ((==3).length) . map B.words $ dna > >> Would someone mind taking another look at the next couple of lines: > > Not at all. > >> qual = BB.fromChunks [BBB.pack . map (maybe err (fromIntegral . fst) . >> B.readInt . (!!1)) $ sdata] > > So - this generates the quality values (which is a bytestring of Word8s) > from reading (readInt) the second word (!!1) on each line, and applying > fromIntegral to convert each of them to Word8. Since readInt returns > Maybe (number,rest_of_bytestring), 'maybe err' is used to either extract > the number, or call 'err'. > > It looks kind of noisy, but not too complicated, really. > >> in if more_magic then qual `seq` (Seq (compact $ B.unwords (label:fields) ) > > Checks the magic number, evaluates the quality values (don't want thunks > pointing into the file to hang around), and builds a Seq structure where > the header is the read label and additional fields from the PHD file > >> (compact $ B.concat $ map head sdata) > > Concatenates the sequence itself, which comes as the (single letter) > words that make up the first column (map head), compact makes sure this > is stored as a contigous lazy bytestring, not a list of single-letter > chunks. > >> (Just qual)) > > And, yes, don't forget the quality values. > >> (unmodified from the original source) and see if there is some reason >> that takeWhile will cause the sequence info not to be passed in properly? > > No, this stuff works as advertised, the problem is clipping out the > region that contains the sequence and quality data (and, IIRC, exact > position in the chromatogram? Which we ignore anyway :-) > > Thanks for working on fixing this - let me know if there's anything else > I can do to help you along. > > -k _______________________________________________ Biohaskell mailing list Biohaskell@biohaskell.org http://malde.org/cgi-bin/mailman/listinfo/biohaskell