On Wednesday, 6 February 2013 at 22:55:14 UTC, FG wrote:
On 2013-02-06 21:43, monarch_dodra wrote:
On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
I have processed the file SRR077487_1.filt.fastq from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
and expect this syntax (no multiline sequences or whitespace).
File takes up almost 6 GB processing took 1m45s - twice as fast as the
fastest D solution so far

Do you mean my solution above? I tried your solution with dmd, with -release -O -inline, and both gave about the same result (69s yours, 67s mine).

Yes. Maybe CPU is the bottleneck on my end.
With DMD32 2.060 on win7-64 compiled with same flags I got:
MD: 4m30 / FG: 1m55s - both using 100% of one core.
Quite similar results with GDC64.

You have timed the same file SRR077487_1.filt.fastq at 67s?

Yes, that file exactly. That said, I'm working on an SSD, so maybe I'm less IO bound than you are?

My attempt was mostly to try and see how fast we could go, while doing it only with high level stuff (eg, no fSomething calls).

Probably, going lower level, and parsing the text manually, waiting for magic characters could yield better result (like what you did).

I'm going to also try playing around with threads: Just last week I wrote a program that did exactly this (asynchronous file reads).

That said, I'll be making this priority n°2. I'd like to make the parser work perfectly first, and in a way that is easily upgradeable/useable. Mr. bio made it perfectly clear that he needed support for whites and line feeds ;)

Reply via email to