Re: Worst-case performance of quickSort / getPivot

Chris Cain Sun, 17 Nov 2013 13:26:04 -0800

On Sunday, 17 November 2013 at 16:57:19 UTC, Andrei Alexandrescuwrote:

Then, people may want to sort records by e.g. Lastname,Firstname, or index a text by words. For names we'd need somecensus data or phonebook.


Sure.

As a source of data, app_c.csv fromhttp://www.census.gov/genealogy/www/data/2000surnames/ -> File B:Surnames...



Test case:
---

import std.stdio, std.algorithm, std.random, std.range, std.conv,std.datetime;


// Name generator constants
enum NumberOfPossibleNames = 1_000_000;
enum NumberOfRandomNames = 1_000_000;

enum SortStrategy = SwapStrategy.unstable;

void main() {
    auto data = File("app_c.csv").byLine
            .drop(1) // First line is just column names
            .take(NumberOfPossibleNames)
            .map!(e => e.text.split(","))
            .array;
    auto names = data.map!(e => e[0]).array;
    auto proportions = data.map!(e => e[2].to!size_t).array;

    auto rnd = Random(50);
    auto randomNames = fastDice(rnd, proportions)
            .take(NumberOfRandomNames)
            .map!(i => names[i])
            .array;

    StopWatch sw = AutoStart.yes;
    randomNames.sort!("a < b", SortStrategy)();
    sw.stop();

    writeln(randomNames.length, " names sorted in ",
        sw.peek.msecs, " msecs using ", SortStrategy, " sort");
}

struct FastDice(Rng, SearchPolicy pol = SearchPolicy.gallop) {
    SortedRange!(size_t[]) _propCumSumSorted;
    size_t _sum;
    size_t _front;
    Rng* _rng;

    this(ref Rng rng, size_t[] proportions) {
        size_t[] _propCumSum = proportions.save.array;
        _rng = &rng;

        size_t mass = 0;
        foreach(ref e; _propCumSum) {
            mass += e;
            e = mass;
        }

        _sum = _propCumSum.back;

        _propCumSumSorted = assumeSorted(_propCumSum);

        popFront();
    }

    void popFront() {
        immutable point = uniform(0, _sum, *_rng);
        assert(point < _sum);

        _front = _propCumSumSorted.lowerBound!pol(point).length;
    }

    enum empty = false;

    @property
    auto front() {
        return _front;
    }
}

auto fastDice(Rng)(ref Rng rng, size_t[] proportions) {
    return FastDice!(Rng)(rng, proportions);
}

auto fastDice(size_t[] proportions) {
    return fastDice(rndGen, proportions);
}
---

Results (using `-O -inline -release -noboundschecks` on mycomputer):

unstable sort: 738 msecs
stable sort: 1001 msecs

Also, to do this (in reasonable time) I had to create a new rangewhich I called "FastDice" ... it does the same asstd.random.dice, but is intended for cases where you'll bethrowing dice throws often on the same data, so it does a bit ofprecomputation (creating a cumulative sum array) and allows forbinary search/gallop to reduce the time to come up with eachthrow. I opted for gallop in this since the data is sorted insuch a way that most common names come first.

It needs a bit of work to make it actually generic, but it's agood start and I'll just say it's WTFPL code, so use it forwhatever. It'll be especially good for generating test cases likeI have done above.

FWIW, FastDice takes ~400ms to generate all those dice throws. Idid a back-of-the-envelope calculation on using dice, and justthe time saved caching the sum saved maybe 30 minutes (No, Ididn't wait that long, I stopped after 5 and wrote FastDice) oftime each run. And the time saved using a gallop search insteadof linear search is pretty significant as well (differencebetween n and log n time).

Re: Worst-case performance of quickSort / getPivot

Reply via email to