On 2017-06-21 18:07:21 -0700, Andres Freund wrote: > On 2017-06-22 09:03:05 +0800, Craig Ringer wrote: > > On 22 June 2017 at 08:29, Andres Freund <and...@anarazel.de> wrote: > > > > > I.e. we're doing tiny write send() syscalls (they should be coalesced) > > > > That's likely worth doing, but can probably wait for a separate patch. > > I don't think so, we should get this right, it could have API influence. > > > > The kernel will usually do some packet aggregation unless we use > > TCP_NODELAY (which we don't and shouldn't), and the syscall overhead > > is IMO not worth worrying about just yet. > > 1) > /* > * Select socket options: no delay of > outgoing data for > * TCP sockets, nonblock mode, > close-on-exec. Fail if any > * of this fails. > */ > if (!IS_AF_UNIX(addr_cur->ai_family)) > { > if (!connectNoDelay(conn)) > { > pqDropConnection(conn, > true); > conn->addr_cur = > addr_cur->ai_next; > continue; > } > } > > 2) Even if nodelay weren't set, this can still lead to smaller packets > being sent, because you start sending normal sized tcp packets, > rather than jumbo ones, even if configured (pretty common these > days). > > 3) Syscall overhead is actually quite significant.
Proof of the pudding: pgbench of 10 pgbench select statements in a batch: as submitted by Daniel: pgbench -h localhost -M prepared -S -n -c 16 -j 16 -T 10000 -P 1 -f ~/tmp/pgbench-select-only-batch.sq progress: 1.0 s, 24175.5 tps, lat 0.647 ms stddev 0.782 progress: 2.0 s, 27737.6 tps, lat 0.577 ms stddev 0.625 progress: 3.0 s, 28853.3 tps, lat 0.554 ms stddev 0.619 progress: 4.0 s, 26660.8 tps, lat 0.600 ms stddev 0.776 progress: 5.0 s, 30023.8 tps, lat 0.533 ms stddev 0.484 progress: 6.0 s, 29959.3 tps, lat 0.534 ms stddev 0.450 progress: 7.0 s, 29944.9 tps, lat 0.534 ms stddev 0.536 progress: 8.0 s, 30137.7 tps, lat 0.531 ms stddev 0.533 progress: 9.0 s, 30285.2 tps, lat 0.528 ms stddev 0.479 progress: 10.0 s, 30228.7 tps, lat 0.529 ms stddev 0.460 progress: 11.0 s, 29921.4 tps, lat 0.534 ms stddev 0.613 progress: 12.0 s, 29982.4 tps, lat 0.533 ms stddev 0.510 progress: 13.0 s, 29247.4 tps, lat 0.547 ms stddev 0.526 progress: 14.0 s, 28757.3 tps, lat 0.556 ms stddev 0.635 progress: 15.0 s, 29035.3 tps, lat 0.551 ms stddev 0.523 ^C sample vmstat: r b swpd free buff cache si so bi bo in cs us sy id wa st 19 0 0 488992 787332 23558676 0 0 0 0 9720 455099 65 35 0 0 0 (i.e. ~450k context switches) hackily patched: pgbench -h localhost -M prepared -S -n -c 16 -j 16 -T 10000 -P 1 -f ~/tmp/pgbench-select-only-batch.sq progress: 1.0 s, 40545.2 tps, lat 0.386 ms stddev 0.625 progress: 2.0 s, 48158.0 tps, lat 0.332 ms stddev 0.277 progress: 3.0 s, 50125.7 tps, lat 0.319 ms stddev 0.204 progress: 4.0 s, 50740.6 tps, lat 0.315 ms stddev 0.250 progress: 5.0 s, 50795.6 tps, lat 0.315 ms stddev 0.246 progress: 6.0 s, 51195.6 tps, lat 0.312 ms stddev 0.207 progress: 7.0 s, 50746.7 tps, lat 0.315 ms stddev 0.264 progress: 8.0 s, 50619.1 tps, lat 0.316 ms stddev 0.250 progress: 9.0 s, 50619.4 tps, lat 0.316 ms stddev 0.228 progress: 10.0 s, 46967.8 tps, lat 0.340 ms stddev 0.499 progress: 11.0 s, 50480.1 tps, lat 0.317 ms stddev 0.239 progress: 12.0 s, 50242.5 tps, lat 0.318 ms stddev 0.286 progress: 13.0 s, 49912.7 tps, lat 0.320 ms stddev 0.266 progress: 14.0 s, 49841.7 tps, lat 0.321 ms stddev 0.271 progress: 15.0 s, 49807.1 tps, lat 0.321 ms stddev 0.248 ^C sample vmstat: r b swpd free buff cache si so bi bo in cs us sy id wa st 23 0 0 482008 787312 23558996 0 0 0 0 8219 105097 87 14 0 0 0 (i.e. ~100k context switches) That's *localhost*. It's completely possible that I've screwed something up here, I didn't test it besides running pgbench, but the send/recv'd data looks like it's similar amounts of data, just fewer syscalls. Greetings, Andres Freund
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index e498ad61e5..aeed1649ce 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -2352,14 +2352,17 @@ doCustom(TState *thread, CState *st, StatsData *agg) if (debug) fprintf(stderr, "client %d receiving\n", st->id); - if (!PQconsumeInput(st->con)) - { /* there's something wrong */ - commandFailed(st, "perhaps the backend died while processing"); - st->state = CSTATE_ABORTED; - break; - } if (PQisBusy(st->con)) - return; /* don't have the whole result yet */ + { + if (!PQconsumeInput(st->con)) + { /* there's something wrong */ + commandFailed(st, "perhaps the backend died while processing"); + st->state = CSTATE_ABORTED; + break; + } + if (PQisBusy(st->con)) + return; /* don't have the whole result yet */ + } if (PQbatchStatus(st->con) == PQBATCH_MODE_ON && !PQbatchProcessQueue(st->con)) diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 4cb87a4393..210410a92c 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -1725,12 +1725,15 @@ PQsendQueryGuts(PGconn *conn, else *last_query = NULL; - /* - * Give the data a push. In nonblock mode, don't complain if we're unable - * to send it all; PQgetResult() will do any additional flushing needed. - */ - if (pqFlush(conn) < 0) - goto sendFailed; + if (conn->batch_status == PQBATCH_MODE_OFF) + { + /* + * Give the data a push. In nonblock mode, don't complain if we're unable + * to send it all; PQgetResult() will do any additional flushing needed. + */ + if (pqFlush(conn) < 0) + goto sendFailed; + } /* OK, it's launched! */ if (conn->batch_status != PQBATCH_MODE_OFF)
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers