On Sunday, 17 November 2013 at 07:19:26 UTC, Andrei Alexandrescu wrote:
On 11/16/13 9:21 PM, Chris Cain wrote:
That said, it might also be reproduced "well enough" using a random generator to create similar strings to sort, but the basic idea is there. I just like using real genomes for performance testing things :)

I am hoping for some more representative corpora, along the lines of http://sortbenchmark.org/. Some data that we can use as good proxies for typical application usage.

Andrei

I think I get what you're saying, but sortbenchmark.org uses completely pseudorandom (but reproducable) entries that I don't think are representative of real data either:

(using gensort -a minus the verification columns)
---
AsfAGHM5om
~sHd0jDv6X
uI^EYm8s=|
Q)JN)R9z-L
o4FoBkqERn
*}-Wz1;TD-
0fssx}~[oB
...
---

Most places use very fake data as proxies for real data. It's better to have something somewhat structured and choose data that, despite not being real data, stresses the benchmark in a unique way.

I'm not suggesting my benchmark be the only one; if we're going to use pseudorandom data (I'm not certain we could actually get "realistic data" that would serve us that much better) we might as well have different test cases that stress the sort routine in different ways. Obviously, using the real genome would be preferable to generating some (since it's actually truly "real" data, just used in an unorthodox way) but there's a disadvantage to attaching a 4.6MB file to a benchmarking setup. Especially if more might come.

Anyway, it's a reasonable representation of "data that has no discernable order that can occasionally take some time to compare." Think something like sorting a list of customer records by name. If they're ordered by ID, then the names would not likely have a discernable pattern and the comparison between names might be "more expensive" because some names can be common.

We can do "more realistic" for that type of scenario, if you'd like. I could look at a distribution for last names/first names and generate fake names to sort in a reasonable approximation of a distribution of real names. I'm not certain the outcome would change that much.

Reply via email to