On Sunday, 17 November 2013 at 16:57:19 UTC, Andrei Alexandrescu wrote:
Then, people may want to sort records by e.g. Lastname, Firstname, or index a text by words. For names we'd need some census data or phonebook.

Sure.

As a source of data, app_c.csv from http://www.census.gov/genealogy/www/data/2000surnames/ -> File B: Surnames...


Test case:
---
import std.stdio, std.algorithm, std.random, std.range, std.conv, std.datetime;

// Name generator constants
enum NumberOfPossibleNames = 1_000_000;
enum NumberOfRandomNames = 1_000_000;

enum SortStrategy = SwapStrategy.unstable;

void main() {
    auto data = File("app_c.csv").byLine
            .drop(1) // First line is just column names
            .take(NumberOfPossibleNames)
            .map!(e => e.text.split(","))
            .array;
    auto names = data.map!(e => e[0]).array;
    auto proportions = data.map!(e => e[2].to!size_t).array;

    auto rnd = Random(50);
    auto randomNames = fastDice(rnd, proportions)
            .take(NumberOfRandomNames)
            .map!(i => names[i])
            .array;

    StopWatch sw = AutoStart.yes;
    randomNames.sort!("a < b", SortStrategy)();
    sw.stop();

    writeln(randomNames.length, " names sorted in ",
        sw.peek.msecs, " msecs using ", SortStrategy, " sort");
}

struct FastDice(Rng, SearchPolicy pol = SearchPolicy.gallop) {
    SortedRange!(size_t[]) _propCumSumSorted;
    size_t _sum;
    size_t _front;
    Rng* _rng;

    this(ref Rng rng, size_t[] proportions) {
        size_t[] _propCumSum = proportions.save.array;
        _rng = &rng;

        size_t mass = 0;
        foreach(ref e; _propCumSum) {
            mass += e;
            e = mass;
        }

        _sum = _propCumSum.back;

        _propCumSumSorted = assumeSorted(_propCumSum);

        popFront();
    }

    void popFront() {
        immutable point = uniform(0, _sum, *_rng);
        assert(point < _sum);

        _front = _propCumSumSorted.lowerBound!pol(point).length;
    }

    enum empty = false;

    @property
    auto front() {
        return _front;
    }
}

auto fastDice(Rng)(ref Rng rng, size_t[] proportions) {
    return FastDice!(Rng)(rng, proportions);
}

auto fastDice(size_t[] proportions) {
    return fastDice(rndGen, proportions);
}
---

Results (using `-O -inline -release -noboundschecks` on my computer):
unstable sort: 738 msecs
stable sort: 1001 msecs


Also, to do this (in reasonable time) I had to create a new range which I called "FastDice" ... it does the same as std.random.dice, but is intended for cases where you'll be throwing dice throws often on the same data, so it does a bit of precomputation (creating a cumulative sum array) and allows for binary search/gallop to reduce the time to come up with each throw. I opted for gallop in this since the data is sorted in such a way that most common names come first.

It needs a bit of work to make it actually generic, but it's a good start and I'll just say it's WTFPL code, so use it for whatever. It'll be especially good for generating test cases like I have done above.

FWIW, FastDice takes ~400ms to generate all those dice throws. I did a back-of-the-envelope calculation on using dice, and just the time saved caching the sum saved maybe 30 minutes (No, I didn't wait that long, I stopped after 5 and wrote FastDice) of time each run. And the time saved using a gallop search instead of linear search is pretty significant as well (difference between n and log n time).

Reply via email to