On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
On 2013-02-04 15:04, bioinfornatics wrote:
I am looking to parse efficiently huge file but i think D lacking for this purpose. To parse 12 Go i need 11 minutes wheras fastxtoolkit (written in c++ ) need 2 min.

Haven't compared to fastxtoolkit, but I have some code for you.
I have processed the file SRR077487_1.filt.fastq from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
and expect this syntax (no multiline sequences or whitespace).
File takes up almost 6 GB processing took 1m45s - twice as fast as the
fastest D solution so far

Do you mean my solution above? I tried your solution with dmd, with -release -O -inline, and both gave about the same result (69s yours, 67s mine).

Data contains both sequence letter and associated quality information. Sequence ID and comment are slices of the buffer, so they have valid info
until you move to the next sequence (and the number increments).

Hum. Mine allocates new slices, so they are never invalidated :)
Mine also takes into account newlines and and lowercase sequences.

Still, it seems you and I both took different approaches. I had mentioned using a re-useable buffer. I'm going to try to consume some of your code to see if I can't improve my implementation.

@bioinfornatics

I'm getting real interested on the subject. I'm going to try to write an actual library/framework for working with fastq files in a D environment.

This means I'll try to write robust and useable code, with both stability and performance in mind, as opposed to the "proofs of concepts in so far".

For now, I'd like to keep it simple: Would something that only knows how to parse/write Sanger FASTQ files be of help to you?

If I write something, can I have you review it?

Reply via email to