Ok, fixed it! :D Posting here for future me (others like me). It would seem (having not read kernel source) that increasing the kernel buffer sizes (rmin_default / rmin_max) does *not* take effect for any processes that are *already* bound or listening to a port/socket. I had previously assumed this was a global kernel buffer, perhaps not.
I had been trying to manage these buffer sizes to resolve the UDP drop issues, but I had not at any time restarted the statistics collector process. I restarted the cluster in a moment of last resort (something I had tried numerous times *before* playing with the buffer sizes) and lo and behold!! no more buffer drops! Problem solved. The pgss_query_texts.stat still wants to live in the default *pg_stat_tmp* directory, wether by design or not.. but that's a non-issue for me now. Thanks for listening :) On Wed, Apr 19, 2017 at 7:36 PM Tim Kane <tim.k...@gmail.com> wrote: > Well, this is frustrating.. > The buffer drops are still occurring - so I thought it worth trying use a > ramdisk and set *stats_temp_directory* accordingly. > > I've reloaded the instance, and can see that the stats directory is now > being populated in the new location. *Except* - there is one last file ( > pgss_query_texts.stat) that continues to be updated in the *old* pg_stat_tmp > path.. Is that supposed to happen? > > > Fairly similar to this guy (but not quite the same). > > https://www.postgresql.org/message-id/d6e71befad7beb4fbcd8ae74fadb1265011bb40fc...@win-8-eml-ex1.eml.local > > I can see the packets arriving and being consumed by the collector.. and, > the collector is indeed updating in the new stats_temp_directory.. just not > for that one file. > > > It also failed to resolve the buffer drops.. At this point, I'm not sure I > expected it to. They tend to occur semi-regularly (every 8-13 minutes) but > I can't correlate them with any kind of activity (and if I'm honest, it's > possibly starting to drive me a little bit mad). > > > > > On Tue, Apr 18, 2017 at 2:53 PM Tim Kane <tim.k...@gmail.com> wrote: > >> Okay, so I've run an strace on the collector process during a buffer drop >> event. >> I can see evidence of a recvfrom loop pulling in a *maximum* of 142kb. >> >> While I've had already increased rmem_max, it would appear this is not >> being observed by the kernel. >> rmem_default is set to 124kb, which would explain the above read maxing >> out just slightly beyond this (presuming a ring buffer filling up behind >> the read). >> >> I'm going to try increasing rmem_default and see if it has any positive >> effect.. (and then investigate why the kernel doesn't want to consider >> rmem_max).. >> >> >> >> >> >> On Tue, Apr 18, 2017 at 8:05 AM Tim Kane <tim.k...@gmail.com> wrote: >> >>> Hi all, >>> >>> I'm seeing sporadic (but frequent) UDP buffer drops on a host that so >>> far I've not been able to resolve. >>> >>> The drops are originating from postgres processes, and from what I know >>> - the only UDP traffic generated by postgres should be consumed by the >>> statistics collector - but for whatever reason, it's failing to read the >>> packets quickly enough. >>> >>> Interestingly, I'm seeing these drops occur even when the system is >>> idle.. but every 15 minutes or so (not consistently enough to isolate any >>> particular activity) we'll see in the order of ~90 packets dropped at a >>> time. >>> >>> I'm running 9.6.2, but the issue was previously occurring on 9.2.4 (on >>> the same hardware) >>> >>> >>> If it's relevant.. there are two instances of postgres running (and >>> consequently, 2 instances of the stats collector process) though 1 of those >>> instances is most definitely idle for most of the day. >>> >>> In an effort to try to resolve the problem, I've increased (x2) the UDP >>> recv buffer sizes on the host - but it seems to have had no effect. >>> >>> cat /proc/sys/net/core/rmem_max >>> 1677216 >>> >>> The following parameters are configured >>> >>> track_activities on >>> track_counts on >>> track_functions none >>> track_io_timing off >>> >>> >>> There are approximately 80-100 connections at any given time. >>> >>> It seems that the issue started a few weeks ago, around the time of a >>> reboot on the given host... but it's difficult to know what (if anything) >>> has changed, or why :-/ >>> >>> >>> Incidentally... the documentation doesn't seem to have any mention of >>> UDP whatsoever. I'm going to use this as an opportunity to dive into the >>> source - but perhaps it's worth improving the documentation around this? >>> >>> My next step is to try disabling track_activities and track_counts to >>> see if they improve matters any, but I wouldn't expect these to generate >>> enough data to flood the UDP buffers :-/ >>> >>> Any ideas? >>> >>> >>> >>>