On 2017-06-21 18:07:21 -0700, Andres Freund wrote:
> On 2017-06-22 09:03:05 +0800, Craig Ringer wrote:
> > On 22 June 2017 at 08:29, Andres Freund <and...@anarazel.de> wrote:
> > 
> > > I.e. we're doing tiny write send() syscalls (they should be coalesced)
> > 
> > That's likely worth doing, but can probably wait for a separate patch.
> 
> I don't think so, we should get this right, it could have API influence.
> 
> 
> > The kernel will usually do some packet aggregation unless we use
> > TCP_NODELAY (which we don't and shouldn't), and the syscall overhead
> > is IMO not worth worrying about just yet.
> 
> 1)
>                                       /*
>                                        * Select socket options: no delay of 
> outgoing data for
>                                        * TCP sockets, nonblock mode, 
> close-on-exec. Fail if any
>                                        * of this fails.
>                                        */
>                                       if (!IS_AF_UNIX(addr_cur->ai_family))
>                                       {
>                                               if (!connectNoDelay(conn))
>                                               {
>                                                       pqDropConnection(conn, 
> true);
>                                                       conn->addr_cur = 
> addr_cur->ai_next;
>                                                       continue;
>                                               }
>                                       }
> 
> 2) Even if nodelay weren't set, this can still lead to smaller packets
>    being sent, because you start sending normal sized tcp packets,
>    rather than jumbo ones, even if configured (pretty common these
>    days).
> 
> 3) Syscall overhead is actually quite significant.

Proof of the pudding:

pgbench of 10 pgbench select statements in a batch:

as submitted by Daniel:

pgbench -h localhost -M prepared -S -n -c 16 -j 16 -T 10000 -P 1 -f 
~/tmp/pgbench-select-only-batch.sq
progress: 1.0 s, 24175.5 tps, lat 0.647 ms stddev 0.782
progress: 2.0 s, 27737.6 tps, lat 0.577 ms stddev 0.625
progress: 3.0 s, 28853.3 tps, lat 0.554 ms stddev 0.619
progress: 4.0 s, 26660.8 tps, lat 0.600 ms stddev 0.776
progress: 5.0 s, 30023.8 tps, lat 0.533 ms stddev 0.484
progress: 6.0 s, 29959.3 tps, lat 0.534 ms stddev 0.450
progress: 7.0 s, 29944.9 tps, lat 0.534 ms stddev 0.536
progress: 8.0 s, 30137.7 tps, lat 0.531 ms stddev 0.533
progress: 9.0 s, 30285.2 tps, lat 0.528 ms stddev 0.479
progress: 10.0 s, 30228.7 tps, lat 0.529 ms stddev 0.460
progress: 11.0 s, 29921.4 tps, lat 0.534 ms stddev 0.613
progress: 12.0 s, 29982.4 tps, lat 0.533 ms stddev 0.510
progress: 13.0 s, 29247.4 tps, lat 0.547 ms stddev 0.526
progress: 14.0 s, 28757.3 tps, lat 0.556 ms stddev 0.635
progress: 15.0 s, 29035.3 tps, lat 0.551 ms stddev 0.523
^C

sample vmstat:
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
19  0      0 488992 787332 23558676    0    0     0     0 9720 455099 65 35  0  
0  0

(i.e. ~450k context switches)

hackily patched:
pgbench -h localhost -M prepared -S -n -c 16 -j 16 -T 10000 -P 1 -f 
~/tmp/pgbench-select-only-batch.sq
progress: 1.0 s, 40545.2 tps, lat 0.386 ms stddev 0.625
progress: 2.0 s, 48158.0 tps, lat 0.332 ms stddev 0.277
progress: 3.0 s, 50125.7 tps, lat 0.319 ms stddev 0.204
progress: 4.0 s, 50740.6 tps, lat 0.315 ms stddev 0.250
progress: 5.0 s, 50795.6 tps, lat 0.315 ms stddev 0.246
progress: 6.0 s, 51195.6 tps, lat 0.312 ms stddev 0.207
progress: 7.0 s, 50746.7 tps, lat 0.315 ms stddev 0.264
progress: 8.0 s, 50619.1 tps, lat 0.316 ms stddev 0.250
progress: 9.0 s, 50619.4 tps, lat 0.316 ms stddev 0.228
progress: 10.0 s, 46967.8 tps, lat 0.340 ms stddev 0.499
progress: 11.0 s, 50480.1 tps, lat 0.317 ms stddev 0.239
progress: 12.0 s, 50242.5 tps, lat 0.318 ms stddev 0.286
progress: 13.0 s, 49912.7 tps, lat 0.320 ms stddev 0.266
progress: 14.0 s, 49841.7 tps, lat 0.321 ms stddev 0.271
progress: 15.0 s, 49807.1 tps, lat 0.321 ms stddev 0.248
^C

sample vmstat:
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
23  0      0 482008 787312 23558996    0    0     0     0 8219 105097 87 14  0  
0  0

(i.e. ~100k context switches)

That's *localhost*.


It's completely possible that I've screwed something up here, I didn't
test it besides running pgbench, but the send/recv'd data looks like
it's similar amounts of data, just fewer syscalls.

Greetings,

Andres Freund
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index e498ad61e5..aeed1649ce 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -2352,14 +2352,17 @@ doCustom(TState *thread, CState *st, StatsData *agg)
 				if (debug)
 					fprintf(stderr, "client %d receiving\n", st->id);
 
-				if (!PQconsumeInput(st->con))
-				{				/* there's something wrong */
-					commandFailed(st, "perhaps the backend died while processing");
-					st->state = CSTATE_ABORTED;
-					break;
-				}
 				if (PQisBusy(st->con))
-					return;		/* don't have the whole result yet */
+				{
+					if (!PQconsumeInput(st->con))
+					{				/* there's something wrong */
+						commandFailed(st, "perhaps the backend died while processing");
+						st->state = CSTATE_ABORTED;
+						break;
+					}
+					if (PQisBusy(st->con))
+						return;		/* don't have the whole result yet */
+				}
 
 				if (PQbatchStatus(st->con) == PQBATCH_MODE_ON &&
 					!PQbatchProcessQueue(st->con))
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 4cb87a4393..210410a92c 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -1725,12 +1725,15 @@ PQsendQueryGuts(PGconn *conn,
 	else
 		*last_query = NULL;
 
-	/*
-	 * Give the data a push.  In nonblock mode, don't complain if we're unable
-	 * to send it all; PQgetResult() will do any additional flushing needed.
-	 */
-	if (pqFlush(conn) < 0)
-		goto sendFailed;
+	if (conn->batch_status == PQBATCH_MODE_OFF)
+	{
+		/*
+		 * Give the data a push.  In nonblock mode, don't complain if we're unable
+		 * to send it all; PQgetResult() will do any additional flushing needed.
+		 */
+		if (pqFlush(conn) < 0)
+			goto sendFailed;
+	}
 
 	/* OK, it's launched! */
 	if (conn->batch_status != PQBATCH_MODE_OFF)
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to