On Sun, Jan 8, 2017 at 11:36 AM, Greg Stark <st...@mit.edu> wrote:
> On 8 January 2017 at 17:26, Greg Stark <st...@mit.edu> wrote:
> > On 5 January 2017 at 19:01, Andres Freund <and...@anarazel.de> wrote:
> >> That's a bit odd - shouldn't the OS network stack take care of this in
> >> both cases? I mean either is too big for TCP packets (including jumbo
> >> frames). What type of OS and network is involved here?
> > 2x may be plausible. The first 128k goes out, then the rest queues up
> > until the first ack comes back. Then the next 128kB goes out again
> > without waiting... I think this is what Nagle is supposed to actually
> > address but either it may be off by default these days or our usage
> > pattern may be defeating it in some way.
> Hm. That wasn't very clear. And the more I think about it, it's not right.
> The first block of data -- one byte in the worst case, 128kB in our
> case -- gets put in the output buffers and since there's nothing
> stopping it it immediately gets sent out. Then all the subsequent data
> gets put in output buffers but buffers up due to Nagle. Until there's
> a full packet of data buffered, the ack arrives, or the timeout
> expires, at which point the buffered data drains efficiently in full
> packets. Eventually it all drains away and the next 128kB arrives and
> is sent out immediately.
> So most packets are full size with the occasional 128kB packet thrown
> in whenever the buffer empties. And I think even when the 128kB packet
> is pending Nagle only stops small packets, not full packets, and the
> window should allow more than one packet of data to be pending.
> So, uh, forget what I said. Nagle should be our friend here.
[I have not done a rigid analysis, here, but...]
I *think* libpq is the culprit here.
walsender says "Hey, libpq - please send (up to) 128KB of data!" and
doesn't "return" until it's "sent". Then it sends more. Regardless of the
underlying cause (nagle, tcp congestion control algorithms, umpteen
different combos of hardware and settings, etc..) in almost every test I
saw improvement (usually quite a bit). This was most easily observable with
high bandwidth-delay product links, but my time in the lab is somewhat
I calculated "performance" the most simple measurement possible: how long
did it take for Y volume of data to get transferred, performed over a
long-enough interval (typically 1800 seconds) for TCP windows to open up,
Dyn / Principal Software Engineer