On 11/17/13 6:20 AM, Chris Cain wrote:
I'm not suggesting my benchmark be the only one; if we're going to use
pseudorandom data (I'm not certain we could actually get "realistic
data" that would serve us that much better) we might as well have
different test cases that stress the sort routine in different ways.
Obviously, using the real genome would be preferable to generating some
(since it's actually truly "real" data, just used in an unorthodox way)
but there's a disadvantage to attaching a 4.6MB file to a benchmarking
setup. Especially if more might come.

OK, since I see you have some interest...

You said nobody would care to actually sort genome data. I'm aiming for data that's likely to be a good proxy for tasks people are interested in doing.

For example, people may be interested in sorting floating-point numbers resulting from sales, measurements, frequencies, probabilities, and whatnot. Since most of those have a Gaussian distribution, a corpus with Gaussian-distributed measurements would be nice.

Then, people may want to sort things by date/time. Depending on the scale the distribution is different - diurnal cycle, weekly cycle, seasonal cycle, secular ebbs and flows etc. I'm unclear on what would be a good set of data. For sub-day time ranges uniform distribution may be appropriate.

Then, people may want to sort records by e.g. Lastname, Firstname, or index a text by words. For names we'd need some census data or phonebook. For general text sorting we can use classic texts such as Alice in Wonderland or the King James Bible (see http://corpus.canterbury.ac.nz/descriptions/). Sorting by word length is a possibility (word lengths are probably Gaussian-distributed).

Uniform random data is also a baseline, not terribly representative, but worth keeping an eye on. In fact uniform data is unfairly rigged in quicksort's favor: any pivot is likely to be pretty good, and there are no sorted runs that often occur in real data.


Andrei

Reply via email to