I'm still getting the same sort of pauses waiting for input with your v11. This is a pretty frustrating problem; I've spent about two days so far trying to narrow down how it happens. I've attached the test program I'm using. It seems related to my picking a throttled rate that's close to (but below) the maximum possible on your system. I'm using 10,000 on a system that can do about 16,000 TPS when running pgbench in debug mode.

This problem is 100% reproducible here; happens every time. This is a laptop running Mac OS X. It's possible the problem is specific to that platform. I'm doing all my tests with the database itself setup for development, with debug and assertions on. The lag spikes seem smaller without assertions on, but they are still there.

Here's a sample:

transaction type: SELECT only
scaling factor: 10
query mode: simple
number of clients: 25
number of threads: 1
duration: 30 s
number of transactions actually processed: 301921
average transaction lag: 1.133 ms (max 137.683 ms)
tps = 10011.527543 (including connections establishing)
tps = 10027.834189 (excluding connections establishing)

And those slow ones are all at the end of the latency log file, as shown in column 3 here:

22 11953 3369 0 1371578126 954881
23 11926 3370 0 1371578126 954918
3 12238 30310 0 1371578126 984634
7 12205 30350 0 1371578126 984742
8 12207 30359 0 1371578126 984792
11 12176 30325 0 1371578126 984837
13 12074 30292 0 1371578126 984882
0 12288 175452 0 1371578127 126340
9 12194 171948 0 1371578127 126421
12 12139 171915 0 1371578127 126466
24 11876 175657 0 1371578127 126507

Note that there are two long pauses here, which happens maybe half of the time. There's a 27 ms pause and then a 145 ms one.

The exact sequence for when the pauses show up is:

-All remaining clients have sent their SELECT statement and are waiting for a response. They are not sleeping, they're waiting for the server to return a query result. -A client receives part of the data it wants, but there is still data left. This is the path through pgbench.c where the "if (PQisBusy(st->con))" test is true after receiving some information. I hacked up some logging that distinguishes those as a "client %d partial receive" to make this visible. -A select() call is made with no timeout, so it can only be satisfied by more data being returned. -Around ~100ms (can be less, can be more) goes by before that select() returns more data to the client, where normally it would happen in ~2ms.

You were concerned about a possible bug in the timeout code. If you hack up the select statement to show some state information, the setup for the pauses at the end always looks like this:

calling select min_usec=9223372036854775807 sleeping=0

When no one is sleeping, the timeout becomes infinite, so only returning data will break it. This is intended behavior though. This exact same sequence and select() call parameters happen in pgbench code every time at the end of a run. But clients without the throttling patch applied never seem to get into the state where they pause for so long, waiting for one of the active clients to receive the next bit of result.

I don't think the st->listen related code has anything to do with this either. That flag is only used to track when clients have completed sending their first query over to the server. Once reaching that point once, afterward they should be "listening" for results each time they exit the doCustom() code. st->listen goes to 1 very soon after startup and then it stays there, and that logic seems fine too.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
#!/bin/bash
make
make install
rm pgbench_log* nohup.out
#nohup pgbench -c 25 -T 30 -l -d -f select.sql -s 10 pgbench
nohup pgbench -c 25 -T 30 -l -d -S -R 10000 pgbench
tail -n 50 pgbench_log*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to