While fooling with the NetBSD-vs-libpython issue noted in a nearby thread, I observed that the core regression tests sometimes hang up in the "stats" test on this platform (NetBSD 8.1/amd64). Investigation found that the stats collector process was sometimes exiting like this:
2019-10-29 19:38:14.563 EDT [7018] FATAL: could not read statistics message: No buffer space available 2019-10-29 19:38:14.563 EDT [7932] LOG: statistics collector process (PID 7018) exited with exit code 1 The postmaster then restarts the collector, but possibly with a time delay (see the PGSTAT_RESTART_INTERVAL respawn throttling logic). This seems to interact badly with the wait-for-stats-collector logic in backend_read_statsfile, so that each cycle of the wait_for_stats() test function takes a long time ... and we will do 300 of those unconditionally. (Possibly wait_for_stats ought to be modified so that it pays attention to elapsed wall-clock time rather than iterating for a fixed number of times?) NetBSD's recv() man page glosses ENOBUFS as "A message was not delivered because it would have overflowed the buffer", but I don't believe that's actually what's happening. (Just to be sure, I added an Assert on the sending side that no message exceeds sizeof(PgStat_Msg). I wonder why we didn't have one already.) Trawling the NetBSD kernel code, it seems like ENOBUFS could get returned as a result of transient shortages of kernel working memory --- most of the uses of that error code seem to be on the sending side, but I found some that seem to be in receiving code. In short: it's evidently possible to get ENOBUFS as a transient failure condition on this platform, and having the stats collector die seems like an overreaction. I'm inclined to have it log the error and press on, instead. Looking at the POSIX spec for recv() suggests that ENOMEM is another plausible transient failure, so maybe we should do the same with that. Roughly: if (len < 0) { + /* silently ignore these cases (no data available) */ if (errno == EAGAIN || errno == EWOULDBLOCK || errno == EINTR) break; /* out of inner loop */ + /* noisily ignore these cases (soft errors) */ + if (errno == ENOBUFS || errno == ENOMEM) + { + ereport(LOG, + (errcode_for_socket_access(), + errmsg("could not read statistics message: %m"))); + break; /* out of inner loop */ + } + /* hard failure, but maybe hara-kiri will fix it */ ereport(ERROR, (errcode_for_socket_access(), errmsg("could not read statistics message: %m"))); } A variant idea is to treat all unexpected errnos as LOG-and-push-on, but maybe we ought to have a limit on how many times we'll do that before erroring out. Thoughts? regards, tom lane