On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu
Not completely clear from this thread what the conclusion was wrt
getting known topN performance issues addressed. From pull
requests it appears identified fixes are in current release
versions of the DMD/LDC. However, I hit significant issues on one
of the first large data sets I tried. Not an artificial data, but
one with very skewed distributions of values (a google ngram
So let me summarize what has happened:
1. topN was reportedly slow. It was using a random pivot. I
made it use getPivot (deterministic) instead of a random pivot
getPivot is also what sort uses.
Details here: https://issues.dlang.org/show_bug.cgi?id=16517.
Includes test program, url for the ngram file.
A brief summary - Data file is a TSV file with 3 numeric fields,
a bit over 10 million values each with different distribution
properties. Used both topN and sort get the median value. Since
this was median, it topN for the mid-point value, not at one end
or the other. (This is a specific callout for some of the issues
Timing comparison of sort and topN, times in milliseconds:
Field 2: 289 1756
Field 3: 285 148793
Field 4: 273 668906
The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar
behavior is seen for DMD 2.071.2. This makes topN pretty much