On Sunday, 17 November 2013 at 07:19:26 UTC, Andrei Alexandrescu
wrote:
On 11/16/13 9:21 PM, Chris Cain wrote:
That said, it might also be reproduced "well enough" using a
random
generator to create similar strings to sort, but the basic
idea is
there. I just like using real genomes for performance testing
things :)
I am hoping for some more representative corpora, along the
lines of http://sortbenchmark.org/. Some data that we can use
as good proxies for typical application usage.
Andrei
I think I get what you're saying, but sortbenchmark.org uses
completely pseudorandom (but reproducable) entries that I don't
think are representative of real data either:
(using gensort -a minus the verification columns)
---
AsfAGHM5om
~sHd0jDv6X
uI^EYm8s=|
Q)JN)R9z-L
o4FoBkqERn
*}-Wz1;TD-
0fssx}~[oB
...
---
Most places use very fake data as proxies for real data. It's
better to have something somewhat structured and choose data
that, despite not being real data, stresses the benchmark in a
unique way.
I'm not suggesting my benchmark be the only one; if we're going
to use pseudorandom data (I'm not certain we could actually get
"realistic data" that would serve us that much better) we might
as well have different test cases that stress the sort routine in
different ways. Obviously, using the real genome would be
preferable to generating some (since it's actually truly "real"
data, just used in an unorthodox way) but there's a disadvantage
to attaching a 4.6MB file to a benchmarking setup. Especially if
more might come.
Anyway, it's a reasonable representation of "data that has no
discernable order that can occasionally take some time to
compare." Think something like sorting a list of customer records
by name. If they're ordered by ID, then the names would not
likely have a discernable pattern and the comparison between
names might be "more expensive" because some names can be common.
We can do "more realistic" for that type of scenario, if you'd
like. I could look at a distribution for last names/first names
and generate fake names to sort in a reasonable approximation of
a distribution of real names. I'm not certain the outcome would
change that much.