So I put in the patch I'd proposed to reduce sleep delays in the stats regression test, and I see that frogmouth has now failed that test twice, with symptoms suggesting that it's dropping the last stats report --- but not all of the stats reports --- from the test's first session. I considered reverting the test patch, but on closer inspection that seems like it would be shooting the messenger, because this is indicating a real and reproducible loss of stats data.
I put in some debug elog's to see how much data is getting shoved at the stats collector during this test, and it seems that it can be as much as about 12K, between what the first session can send at exit and what the second session will send immediately after startup. (I think this value should be pretty platform-independent, but be it noted that I'm measuring on a 64-bit Linux system while frogmouth is 32-bit Windows.) Now, what's significant about that is that frogmouth is a pretty old Windows version, and what I read on the net is that Windows versions before 2012 have only 8KB socket receive buffer size by default. So this behavior is plausibly explained by the theory that the stats collector's receive buffer is overflowing, causing loss of stats messages. This could well explain the stats-test failures we've seen in the past too, which if memory serves were mostly on Windows. Also, it's clear that a session could easily shove much more than 8KB at a time out to the stats collector, because what we're doing in the stats test does not involve touching any very large number of tables. So I think this is not just a test failure but is telling us about a plausible mechanism for real-world statistics drops. I observe a default receive buffer size around 124K on my Linux box, which seems like it'd be far less prone to overflow than 8K. I propose that it'd be a good idea to try to set the stats socket's receive buffer size to be a minimum of, say, 100K on all platforms. Code for this would be analogous to what we already have in pqcomm.c (circa line 760) for forcing up the send buffer size, but SO_RCVBUF not SO_SNDBUF. A further idea is that maybe backends should be tweaked to avoid blasting large amounts of data at the stats collector in one go. That would require more thought to design, though. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers