On 2013-02-06 21:43, monarch_dodra wrote:
On Wednesday, 6 February 2013 at 19:19:52 UTC, FG wrote:
I have processed the file SRR077487_1.filt.fastq from
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG00096/sequence_read/
and expect this syntax (no multiline sequences or whitespace).
File takes up almost 6 GB processing took 1m45s - twice as fast as the
fastest D solution so far

Do you mean my solution above? I tried your solution with dmd, with -release -O
-inline, and both gave about the same result (69s yours, 67s mine).

Yes. Maybe CPU is the bottleneck on my end.
With DMD32 2.060 on win7-64 compiled with same flags I got:
MD: 4m30 / FG: 1m55s - both using 100% of one core.
Quite similar results with GDC64.

You have timed the same file SRR077487_1.filt.fastq at 67s?


I'm getting real interested on the subject. I'm going to try to write an actual
library/framework for working with fastq files in a D environment.

Those fastq are contagious. ;)

This means I'll try to write robust and useable code, with both stability and
performance in mind, as opposed to the "proofs of concepts in so far".

Yeah, but the big deal was that D is 5.5x slower than C++.

You have mentioned something about using byLine. Well, I would have gladly used
it instead of looking for line ends myself and pushing stuff with memcpy.
But the thing is that while the fgets(char *buf, int bufSize, FILE *f) in fastx
is fast in reading file by line, using file.readln(buf) is unpredictable. :)
I mean that in DMD it's only a bit slower than file.rawRead(buf), but in GDC
can be several times slower. For example just reading in a loop:

    import std.stdio;
    enum uint bufferSize = 4096 - 16;
    void main(string[] args) {
        char[] tmp, buf = new char[bufferSize];
        size_t cnt;
        auto f = File(args[1], "r");
        switch(args[2]) {
            case "raw":
                do tmp = f.rawRead(buf); while (tmp.length);
                break;

            case "readln":
                do cnt = f.readln(buf); while (cnt);
                break;

            default: writeln("Use parameters: <filename> raw|readln");
        }
    }

Tested on a much smaller SRR077487.filt.fastq:
DMD32 -release -O -inline: raw 94ms / readln 450ms
GDC64 -O3:                 raw 94ms / readln 6.76s

Tested on SRR077487_1.filt.fastq:
DMD32 -release -O -inline: raw 1m44s / readln  1m55s
GDC64 -O3:                 raw 1m48s / readln 14m16s

Why such a big difference between the DMD and GDC (on Windows)?
(or have I missed some switch in GDC?)

Reply via email to