On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescu wrote:

So let me summarize what has happened:

1. topN was reportedly slow. It was using a random pivot. I made it use getPivot (deterministic) instead of a random pivot in https://github.com/D-Programming-Language/phobos/pull/3921. getPivot is also what sort uses.

[snip]

Not completely clear from this thread what the conclusion was wrt getting known topN performance issues addressed. From pull requests it appears identified fixes are in current release versions of the DMD/LDC. However, I hit significant issues on one of the first large data sets I tried. Not an artificial data, but one with very skewed distributions of values (a google ngram file).

Details here: https://issues.dlang.org/show_bug.cgi?id=16517. Includes test program, url for the ngram file.

A brief summary - Data file is a TSV file with 3 numeric fields, a bit over 10 million values each with different distribution properties. Used both topN and sort get the median value. Since this was median, it topN for the mid-point value, not at one end or the other. (This is a specific callout for some of the issues identified.)

Timing comparison of sort and topN, times in milliseconds:

          sort      topN
Field 2:   289      1756
Field 3:   285    148793
Field 4:   273    668906

The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similar behavior is seen for DMD 2.071.2. This makes topN pretty much unusable.

Reply via email to