On Thu, Oct 6, 2016 at 11:38 AM, Robert Haas <robertmh...@gmail.com> wrote:

> Next, I tried lowering the scale factor to something that fits in
> shared buffers.  Here are the results at scale factor 300:
>      14  Lock            | tuple
>      22  LWLockTranche   | lock_manager
>      39  LWLockNamed     | WALBufMappingLock
>     331  LWLockNamed     | CLogControlLock
>     461  LWLockNamed     | ProcArrayLock
>     536  Lock            | transactionid
>     552  Lock            | extend
>     716  LWLockTranche   | buffer_content
>     763  LWLockNamed     | XidGenLock
>    2113  LWLockNamed     | WALWriteLock
>    6190  LWLockTranche   | wal_insert
>   25002  Client          | ClientRead
>   78466                  |
> tps = 27651.562835 (including connections establishing)
> Obviously, there's a vast increase in TPS, and the backends seem to
> spend most of their time actually doing work.  ClientRead is now the
> overwhelmingly dominant wait event, although wal_insert and
> WALWriteLock contention is clearly still a significant problem.
> Contention on other locks is apparently quite rare.  Notice that
> client reads are really significant here - more than 20% of the time
> we sample what a backend is doing, it's waiting for the next query.
> It seems unlikely that performance on this workload can be improved
> very much by optimizing anything other than WAL writing, because no
> other wait event appears in more than 1% of the samples.  It's not
> clear how much of the WAL-related stuff is artificial lock contention
> and how much of it is the limited speed at which the disk can rotate.

What happens if you turn fsync off?  Once a xlog file is fully written, it
is immediately fsynced, even if the backend is holding WALWriteLock or
wal_insert (or both) at the time, and even if synchrounous_commit is off.
Assuming this machine has a BBU so that it doesn't have to wait for disk
rotation, still fsyncs are expensive because the kernel has to find all the
data and get it sent over to the BBU, while holding locks.


Second, ClientRead becomes a bigger and bigger issue as the number of
> clients increases; by 192 clients, it appears in 45% of the samples.
> That basically means that pgbench is increasingly unable to keep up
> with the server; for whatever reason, it suffers more than the server
> does from the increasing lack of CPU resources.

I would be careful about that interpretation.  If you asked pgbench, it
would probably have the opposite opinion.

The backend tosses its response at the kernel (which will never block,
because the pgbench responses are all small and the kernel will buffer
them) and then goes into ClientRead.  After the backend goes into ClientRead,
the kernel needs to find and wake up the pgbench, deliver the response, and
pgbench has to receive and process the response.  Only then does it create
a new query (I've toyed before with having pgbench construct the next query
while it is waiting for a response on the previous one, but that didn't
seem promising, and much of pgbench has been rewritten since then), pass
the query back to the kernel. Then the kernel has to find and wake up the
backend and deliver the new query.  So for a reasonable chunk of the time
that the server thinks it is waiting for the client, the client also thinks
it is waiting for the server.

I think we need to come up with some benchmarking queries which get more
work done per round-trip to the database. And build them into the binary,
because otherwise people won't use them as much as they should if they have
to pass "-f" files around mailing lists and blog postings.   For example,
we could enclose 5 statements of the TPC-B-like into a single function
which takes aid, bid, tid, and delta as arguments.  And presumably we could
drop the other two statements (BEGIN and COMMIT) as well, and rely on
autocommit to get that job done.  So we could go from 7 statements to 1.

> Third,
> Lock/transactionid and Lock/tuple become more and more common as the
> number of clients increases; these happen when two different pgbench
> threads deciding to hit the same row at the same time.  Due to the
> birthday paradox this increases pretty quickly as the number of
> clients ramps up, but it's not really a server issue: there's nothing
> the server can do about the possibility that two or more clients pick
> the same random number at the same time.

What I have done in the past is chop a zero off from:

#define naccounts   100000

and recompile pgbench.  Then you can increase the scale factor so that you
have less contention on pgbench_branches while still fitting the data in
shared_buffers, or in RAM.



Reply via email to