On Wednesday, 6 February 2013 at 15:40:39 UTC, bioinfornatics wrote:
It seem in any case is not easy to parse fastly a file in D

I don't think that's true. D provides the same "FILE" primitive you'd get in C, so there is no reason for it to be slower than C.

It is the "range" approach that, as convenient as it is, is not well adapted for certain things.

As I had said, I tried to write my own program. In it, I devised a range that, instead of exposing things to parse character by character, parses an entire "object" (a ... "genome" ... maybe ? I called them "Q" in my program) at once into an object. I decided to use the very simple "byLine" primitive.

From there, you can query the object for their name/sequence/quality. The irony is that by "parsing twice" (once to do the io read, once to do the actual processing of the text), and taking into account I'm allocating each object individually, I'm running twice as fast as my original already improved implementation. Not only is it faster, it is also more convenient, since you can extract an entire Q object at once, and then operate on that as you would so please: Separation of algorithm and parsing.

It correctly takes into account that a sequence can be multiple lines. It does not strip whitespace because according to http://maq.sourceforge.net/fastq.shtml whitespace is not a legal character.

Now: Keep in mind that this approach allocates (3) new strings for each Q. You could *try* an approach with a pre-allocated re-useable buffer. This would mean you can only operate on 1 Q at once, but you'd probably iterate on them faster.

In any case, you can try it out:
http://dpaste.dzfl.pl/8bdd0c84

Reply via email to