On Sunday, 17 November 2013 at 04:14:22 UTC, Andrei Alexandrescu wrote:
Probably a good step forward would be to hook a sort benchmarking corpus to our unittests. What are the consecrated corpora?

I'll kick off the suggestions with this:

File to provide "real" data:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.fna

From http://www.genome.jp/kegg-bin/show_organism?org=eco -> RefSeq -> NC_000913.fna

(despite this being real data, typically you don't actually sort genomes like this ... it's simply "more interesting" than a pseudo-randomized dataset)

Code:

---
import std.range, std.algorithm, std.stdio, std.array, std.conv, std.datetime;

enum strategy = SwapStrategy.stable;

void main() {
    auto arr = File("NC_000913.fna").byLine
.drop(1) // Trim the first line because it's just metadata
            /*.take(50)*/
            .join
            .chunks(50)
            .map!text
            .array;
    StopWatch sw = StopWatch(AutoStart.yes);
    arr.sort!("a < b", strategy)();
    sw.stop();
writeln(text(arr.length, " elements sorted in ", sw.peek.msecs, " msecs"));
}
---

On my system, this shows unstable sort to be ~2x faster than stable sort (which is to be expected of data with no discernible patterns).

The "key" thing this is testing is the effectiveness of the sort on data that often takes longer to compare (with such a small alphabet, many strings have a common prefix).

That said, it might also be reproduced "well enough" using a random generator to create similar strings to sort, but the basic idea is there. I just like using real genomes for performance testing things :)

Reply via email to