On Thu, Aug 7, 2014 at 1:09 AM, Peter Geoghegan <p...@heroku.com> wrote:
> On Wed, Aug 6, 2014 at 10:36 PM, Peter Geoghegan <p...@heroku.com> wrote:
>> This *almost* applies to patched Postgres if you pick a benchmark that
>> is very sympathetic to my patch. To my surprise, work_mem = '10MB'
>> (which results in an external tape sort) is sometimes snapping at the
>> heels of a work_mem = '5GB' setting (which results in an in-memory
>> quicksort).
>
> Note that this was with a default temp_tablespaces setting that wrote
> temp files on my home partition/SSD. With a /dev/shm/ temp tablespace,
> tape sort edges ahead, and has a couple of hundred milliseconds on
> quicksort for this test case. It's actually faster.

I decided to do a benchmark of a very large, sympathetic sort. Again,
this was with 8 byte random text strings, but this time with a total
of 500 million tuples. This single-column table is about 21 GB. This
ran on a dedicated server with 32 GB of ram. I took all the usual
precautions (the cache was warmed, tuples were frozen, etc).

Master took 02:02:20.5561 to sort the data with a 10 MB work_mem
setting, without a ramdisk in temp_tablespaces. With a
temp_tablespaces /dev/shm ramdisk, there was only a very small
improvement that left total execution time at 02:00:58.51878 (while a
later repeat attempt took 02:00:41.385937) -- an improvement hardly
worth bothering with.

Patched Postgres took 00:16:13.228053, again with a work_mem of 10 MB,
but no ramdisk. When I changed temp_tablespaces to use the same
ramdisk, this went down to  00:11:58.77301, a significant improvement.
This is clearly because the data directory was on a filesystem on
spinning disks, and more I/O bandwidth (real or simulated) helps
external sorts. Since this machine only has 32GB of ram, and a
significant proportion of that must be used for shared_buffers (8GB)
and the OS cache, I think there is a fair chance that a more capable
I/O subsystem could be used to get appreciably better results using
otherwise identical hardware.

Comparing like with like, the ramdisk patched run was over 10 times
faster than master with the same ramdisk. While that disparity is
obviously in itself very significant, I think the disparity in how
much faster things were with a ramdisk for patched, but not for master
is also significant.

I'm done with sympathetic cases now. I welcome unsympathetic ones, or
more broadly representative large tests. It's hard to come up with a
benchmark that isn't either sympathetic or very sympathetic, or a
pathological bad case. There is actually a simple enough C program for
generating test input, "gensort", which is used by sortbenchmark.org:

http://www.ordinal.com/gensort.html

(I suggest building without support for "the SUMP Pump library", by
modifying the Makefile before building)

What's interesting about gensort is that there is a "skew" option.
Without it, I can generate totally random ASCII keys. But with it,
there is a tendency for there to be a certain amount of redundancy
between keys in their first few bytes. This is intended to limit the
effectiveness of abbreviation-type optimizations for sortbenchmark.org
"Daytona Sort" entrants:

http://sortbenchmark.org/FAQ-2014.html ("Indy vs. Daytona")

Daytona sort is basically a benchmark that focuses on somewhat
commercially representative data (often ASCII data), with sneaky
data-aware tricks forbidden, as opposed to totally artificial
uniformly distributed random binary data, which is acceptable for
their "Indy sort" benchmark that gets to use every trick in the book.
They note here that Daytona entrants should "not be overly dependent
on the uniform and random distribution of key values in the sort
input". They are allowed to be somewhat dependent on it, though - for
one thing, keys are always exactly 10 bytes. They must merely "be able
to sort the alternate, skewed-keys input data set in an elapsed time
of no more than twice the elapsed time of the benchmark entry".

It might be useful for Robert or other reviewers to hold the
abbreviated keys patch to a similar standard, possibly by using
gensort, or their own modified version. I've shown a sympathetic case
that is over 10 times faster, with some other less sympathetic cases
that were still pretty good, since tie-breakers were generally able to
get away with a cheap memcmp(). There has also been some tests showing
pathological bad cases for the optimization. The middle ground has now
become interesting, and gensort might offer a half-reasonable way to
generate tests that are balanced. I've looked at the gensort code, and
it seems easy enough to understand and modify for our purposes. You
might want to look at multiple cases with this constant modified, for
example:

#define SKEW_BYTES     6

Since this controls the number of leading bytes that come from a table
of skew bytes.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to