Re: topN using a heap

Jon Degenhardt via Digitalmars-d Wed, 21 Sep 2016 01:22:01 -0700

On Tuesday, 19 January 2016 at 00:11:40 UTC, Andrei Alexandrescuwrote:

So let me summarize what has happened:
1. topN was reportedly slow. It was using a random pivot. Imade it use getPivot (deterministic) instead of a random pivotin https://github.com/D-Programming-Language/phobos/pull/3921.getPivot is also what sort uses.
[snip]

Not completely clear from this thread what the conclusion was wrtgetting known topN performance issues addressed. From pullrequests it appears identified fixes are in current releaseversions of the DMD/LDC. However, I hit significant issues on oneof the first large data sets I tried. Not an artificial data, butone with very skewed distributions of values (a google ngramfile).

Details here: https://issues.dlang.org/show_bug.cgi?id=16517.Includes test program, url for the ngram file.

A brief summary - Data file is a TSV file with 3 numeric fields,a bit over 10 million values each with different distributionproperties. Used both topN and sort get the median value. Sincethis was median, it topN for the mid-point value, not at one endor the other. (This is a specific callout for some of the issuesidentified.)


Timing comparison of sort and topN, times in milliseconds:

          sort      topN
Field 2:   289      1756
Field 3:   285    148793
Field 4:   273    668906

The above times are for LDC 1.1.0-beta2 (DMD 2.071.1). Similarbehavior is seen for DMD 2.071.2. This makes topN pretty muchunusable.

Re: topN using a heap

Reply via email to