Re: Worst-case performance of quickSort / getPivot

Chris Cain Sun, 17 Nov 2013 06:26:07 -0800

On Sunday, 17 November 2013 at 07:19:26 UTC, Andrei Alexandrescuwrote:

On 11/16/13 9:21 PM, Chris Cain wrote:
That said, it might also be reproduced "well enough" using arandomgenerator to create similar strings to sort, but the basicidea isthere. I just like using real genomes for performance testingthings :)
I am hoping for some more representative corpora, along thelines of http://sortbenchmark.org/. Some data that we can useas good proxies for typical application usage.
Andrei

I think I get what you're saying, but sortbenchmark.org usescompletely pseudorandom (but reproducable) entries that I don'tthink are representative of real data either:


(using gensort -a minus the verification columns)
---
AsfAGHM5om
~sHd0jDv6X
uI^EYm8s=|
Q)JN)R9z-L
o4FoBkqERn
*}-Wz1;TD-
0fssx}~[oB
...
---

Most places use very fake data as proxies for real data. It'sbetter to have something somewhat structured and choose datathat, despite not being real data, stresses the benchmark in aunique way.

I'm not suggesting my benchmark be the only one; if we're goingto use pseudorandom data (I'm not certain we could actually get"realistic data" that would serve us that much better) we mightas well have different test cases that stress the sort routine indifferent ways. Obviously, using the real genome would bepreferable to generating some (since it's actually truly "real"data, just used in an unorthodox way) but there's a disadvantageto attaching a 4.6MB file to a benchmarking setup. Especially ifmore might come.

Anyway, it's a reasonable representation of "data that has nodiscernable order that can occasionally take some time tocompare." Think something like sorting a list of customer recordsby name. If they're ordered by ID, then the names would notlikely have a discernable pattern and the comparison betweennames might be "more expensive" because some names can be common.

We can do "more realistic" for that type of scenario, if you'dlike. I could look at a distribution for last names/first namesand generate fake names to sort in a reasonable approximation ofa distribution of real names. I'm not certain the outcome wouldchange that much.

Re: Worst-case performance of quickSort / getPivot

Reply via email to