Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures

Noah Misch Tue, 09 Feb 2016 22:59:52 -0800

On Mon, Feb 08, 2016 at 10:55:24PM -0500, Tom Lane wrote:
> Noah Misch <n...@leadboat.com> writes:
> > On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:
> >> We've seen variants
> >> on this theme on half a dozen machines just in the past week --- and it
> >> seems to mostly happen in 9.5 and HEAD, which is fishy.
> 
> > It has been affecting only the four AIX animals, which do share hardware.
> > (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)
> 
> Certainly your AIX critters have shown this a bunch, but here's another
> current example:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23

Oops; I did not consider Monday's results before asserting that.

> > That's reasonable.  If you would like higher-fidelity data, I can run loops 
> > of
> > "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
> > that for HEAD and 9.2 simultaneously.  A day of logs from that should show
> > clearly if HEAD is systematically worse than 9.2.
> 
> That sounds like a fine plan, please do it.

Log files:

HEAD: 
https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMakl2TjFHUlpvc1k
92:   
https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMYVZtY3VqcjBFX1k

While I didn't study those logs in detail, a few things jumped out.  Since
9.2, we've raised the default shared_buffers from 32MB to 128MB, and we've
replaced checkpoint_segments=3 with max_wal_size=1GB.  Both changes encourage
bulkier checkpoints.  The 9.2 test runs get one xlog-driven checkpoint before
the shutdown checkpoint, while HEAD gets one time-driven checkpoint.  Also,
the HEAD suite just tests more things.  Here's pg_stat_bgwriter afterward:

HEAD:
checkpoints_timed     | 156
checkpoints_req       | 799
checkpoint_write_time | 16035847
checkpoint_sync_time  | 6555396
buffers_checkpoint    | 744131
buffers_clean         | 0
maxwritten_clean      | 0
buffers_backend       | 3023444
buffers_backend_fsync | 0
buffers_alloc         | 1777010
stats_reset           | 2016-02-08 21:04:24.499607-08

9.2:
checkpoints_timed     | 39
checkpoints_req       | 1369
checkpoint_write_time | 14875776
checkpoint_sync_time  | 8397536
buffers_checkpoint    | 396272
buffers_clean         | 466392
maxwritten_clean      | 1336
buffers_backend       | 1961531
buffers_backend_fsync | 0
buffers_alloc         | 1681324
stats_reset           | 2016-02-08 21:09:21.925487-08

Most notable there is the lack of bgwriter help in HEAD.  The clusters had
initdb-default configuration apart from these additions:

listen_addresses=''
log_line_prefix = '%p %m '
logging_collector = on
log_autovacuum_min_duration = 0
log_checkpoints = on
log_lock_waits = on
log_temp_files = 128kB

I should have added fsync=off, too.  Notice how the AIX animals failed left
and right today, likely thanks to contention from these runs.

On Tue, Feb 09, 2016 at 02:10:50PM -0500, Tom Lane wrote:
> (1) Slow file system, specifically slow unlink, is the core of the
> problem.  (I wonder if the AIX critters are using an NFS filesystem?)

The buildfarm files lie on 600 GiB SAS disks.  I suspect metadata operations,
like unlink(), bottleneck on the jfs2 journal.

On Tue, Feb 09, 2016 at 03:05:01PM -0500, Tom Lane wrote:
> I'm now in favor of applying the PGCTLTIMEOUT patch Noah proposed, and
> *removing* the two existing hacks in run_build.pl that try to force -t 120.
> 
> The only real argument I can see against that approach is that we'd have
> to back-patch the PGCTLTIMEOUT patch to all active branches if we want
> to stop the buildfarm failures.  We don't usually back-patch feature
> additions.  On the other hand, this wouldn't be the first time we've
> back-patched something on grounds of helping the buildfarm, so I find
> that argument pretty weak.

If I were a purist against back-patching features, I might name the variable
PGINTERNAL_TEST_PGCTLTIMEOUT and not document it.  Meh to that.  I'll plan to
commit the original tomorrow.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures

Reply via email to