On 10/31/2016 02:51 PM, Amit Kapila wrote:
On Mon, Oct 31, 2016 at 12:02 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

On 10/27/2016 01:44 PM, Amit Kapila wrote:

I've read that analysis, but I'm not sure I see how it explains the "zig
zag" behavior. I do understand that shifting the contention to some other
(already busy) lock may negatively impact throughput, or that the
group_update may result in updating multiple clog pages, but I don't
understand two things:

(1) Why this should result in the fluctuations we observe in some of the
cases. For example, why should we see 150k tps on, 72 clients, then drop to
92k with 108 clients, then back to 130k on 144 clients, then 84k on 180
clients etc. That seems fairly strange.

I don't think hitting multiple clog pages has much to do with
client-count.  However, we can wait to see your further detailed test

(2) Why this should affect all three patches, when only group_update has to
modify multiple clog pages.

No, all three patches can be affected due to multiple clog pages.
Read second paragraph ("I think one of the probable reasons that could
happen for both the approaches") in same e-mail [1].  It is basically
due to frequent release-and-reacquire of locks.

On logged tables it usually looks like this (i.e. modest increase for
client counts at the expense of significantly higher variability):


What variability are you referring to in those results?

Good question. What I mean by "variability" is how stable the tps is during
the benchmark (when measured on per-second granularity). For example, let's
run a 10-second benchmark, measuring number of transactions committed each

Then all those runs do 1000 tps on average:

  run 1: 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000
  run 2: 500, 1500, 500, 1500, 500, 1500, 500, 1500, 500, 1500
  run 3: 0, 2000, 0, 2000, 0, 2000, 0, 2000, 0, 2000

Generally, such behaviours are seen due to writes. Are WAL and DATA
on same disk in your tests?

Yes, there's one RAID device on 10 SSDs, with 4GB of the controller. I've done some tests and it easily handles > 1.5GB/s in sequential writes, and >500MB/s in sustained random writes.

Also, let me point out that most of the tests were done so that the whole data set fits into shared_buffers, and with no checkpoints during the runs (so no writes to data files should really happen).

For example these tests were done on scale 3000 (45GB data set) with 64GB shared buffers:

[a] http://tvondra.bitbucket.org/index2.html#pgbench-3000-unlogged-sync-noskip-64

[b] http://tvondra.bitbucket.org/index2.html#pgbench-3000-logged-async-noskip-64

and I could show similar cases with scale 300 on 16GB shared buffers.

In those cases, there's very little contention between WAL and the rest of the data base (in terms of I/O).

And moreover, this setup (single device for the whole cluster) is very common, we can't just neglect it.

But my main point here really is that the trade-off in those cases may not be really all that great, because you get the best performance at 36/72 clients, and then the tps drops and variability increases. At least not right now, before tackling contention on the WAL lock (or whatever lock becomes the bottleneck).


Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to