I decided to give the code a spin with `gdc -O3 -pg`. Turns out that the hotspot is in std.array.split, contrary to expectations. :-) Here are the first few lines of the gprof output:
-----snip----- Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 19.77 0.43 0.43 76368364 0.00 0.00 nothrow @safe int std.array.split!(char[]).split(char[]).__foreachbody2(ref ulong, ref dchar) 16.74 0.79 0.36 nothrow void gc.gc.Gcx.mark(void*, void*, int) 10.23 1.01 0.22 _aApplycd2 8.14 1.18 0.18 _d_arrayappendcTX 4.19 1.27 0.09 nothrow ulong gc.gc.Gcx.fullcollect() -----snip----- As you can see, std.array.split takes up almost 20% of the total running time. The surprising (or not-so-surprising?) second runner-up for hot spot is actually the GC's marking algorithm. I'm guessing this is likely because of the extensive use of small GC-allocated arrays (splitting into strings, and the like). The 3rd entry, _aApplycd2, appears to be the druntime implementation of the foreach loop, so I'm not sure what could be done about it. Anyway, just for kicks, I tried various ways of reducing the cost of std.array.split, but didn't get very far. Replacing it with std.regex.split didn't help. Looking at its implementation, while it does allocate a new array each time, it also slices over the input when constructing the substrings, so it didn't seem as inefficient as I first thought. Next I tried disabling the GC with core.memory.GC.disable(). Immediately, I got a 20% performance boost. Of course, running with a disabled GC will soon eat up all memory and crash, which isn't an option in real-world usage, so the next best solution is to manually schedule GC collection cycles, say call GC.collect() every n iterations of the parsing loop, for some value of n. I tried implementing a crude version of this (see code below), and found that manually calling GC.collect() even as frequently as once every 5000 loop iterations (for a 500,000 line test input file) still gives about 15% performance improvement over completely disabling the GC. Since most of the arrays involved here are pretty small, the frequency could be reduced to once every 50,000 iterations and you'd pretty much get the 20% performance boost for free, and still not run out of memory too quickly. Note that all measurements were done with `gdc -O3`. I did try the original code with `dmd -O`, and found that it was 20% slower than the gdc version, so I didn't look further. Anyway, here's the code with the manual GC collection schedule (I modified main() slightly so that I could easily test the code with different input files, so you have to specify the input filename as an argument to the program when running it): -------snip------- #!/usr/bin/env rdmd // Programmed in the D language // Fredrik Boulund 2015-09-09 // Modified by H. S. Teoh 2015-09-14 import core.memory; // for GC control import std.stdio; import std.array; import std.conv; import std.algorithm; struct Hit { string target; float pid; int matches, mismatches, gaps, qstart, qstop, tstart, tstop; double evalue, bitscore; } enum manualGcCount = 5_000; ulong ticksToCollect = manualGcCount; void gcTick() { if (--ticksToCollect == 0) { GC.collect(); ticksToCollect = manualGcCount; } } void custom_parse(string filename) { float min_identity = 90.0; int min_matches = 10; Hit[][string] hitlists; foreach (record; filename .File .byLine .map!split .filter!(l => l[2].to!float >= min_identity && l[3].to!int >= min_matches)) { hitlists[record[0].to!string] ~= Hit(record[1].to!string, record[2].to!float, record[3].to!int, record[4].to!int, record[5].to!int, record[6].to!int, record[7].to!int, record[8].to!int, record[9].to!int, record[10].to!double, record[11].to!double); gcTick(); } foreach (query; hitlists.byKey) { float max_pid = reduce!((a,b) => max(a, b.pid))(-double.max, hitlists[query]); float max_pid_diff = 5.00; hitlists[query] = hitlists[query].filter!(h => h.pid >= (max_pid - max_pid_diff)).array(); writeln(query, ": ", hitlists[query].length); gcTick(); } } void main(string[] args) { auto filename = args[1]; //auto filename = "blat.blast8"; //auto filename = "./QE_150828_46.blast8"; GC.disable(); custom_parse(filename); } -------snip------- T -- Stop staring at me like that! It's offens... no, you'll hurt your eyes!