Re: Worst-case performance of quickSort / getPivot

Andrei Alexandrescu Sun, 17 Nov 2013 09:01:38 -0800

On 11/17/13 6:20 AM, Chris Cain wrote:

I'm not suggesting my benchmark be the only one; if we're going to use
pseudorandom data (I'm not certain we could actually get "realistic
data" that would serve us that much better) we might as well have
different test cases that stress the sort routine in different ways.
Obviously, using the real genome would be preferable to generating some
(since it's actually truly "real" data, just used in an unorthodox way)
but there's a disadvantage to attaching a 4.6MB file to a benchmarking
setup. Especially if more might come.


OK, since I see you have some interest...

You said nobody would care to actually sort genome data. I'm aiming fordata that's likely to be a good proxy for tasks people are interested indoing.

For example, people may be interested in sorting floating-point numbersresulting from sales, measurements, frequencies, probabilities, andwhatnot. Since most of those have a Gaussian distribution, a corpus withGaussian-distributed measurements would be nice.

Then, people may want to sort things by date/time. Depending on thescale the distribution is different - diurnal cycle, weekly cycle,seasonal cycle, secular ebbs and flows etc. I'm unclear on what would bea good set of data. For sub-day time ranges uniform distribution may beappropriate.

Then, people may want to sort records by e.g. Lastname, Firstname, orindex a text by words. For names we'd need some census data orphonebook. For general text sorting we can use classic texts such asAlice in Wonderland or the King James Bible (seehttp://corpus.canterbury.ac.nz/descriptions/). Sorting by word length isa possibility (word lengths are probably Gaussian-distributed).

Uniform random data is also a baseline, not terribly representative, butworth keeping an eye on. In fact uniform data is unfairly rigged inquicksort's favor: any pivot is likely to be pretty good, and there areno sorted runs that often occur in real data.



Andrei

Re: Worst-case performance of quickSort / getPivot

Reply via email to