On Thu, Oct 30, 2014 at 10:46 AM, Robert Haas <robertmh...@gmail.com> wrote: > (9.3.5 problem report)
I think I saw a similar issue, by a 9.3.5 instance that was affected by the "in pg_upgrade, remove pg_multixact files left behind by initdb" issue (I ran the remediation recommended in the 9.3.5 release notes). Multiple anti-wraparound vacuums were stuck following a PITR. I resolved this (as far as I can tell) by killing the autovacuum workers, and manually running VACUUM FREEZE. I have yet to do any root cause analysis, but I think I could reproduce the problem. > The fundamental structure of that function is an infinite loop. We > break out of that loop when BTEntrySame(item, &stack->bts_btentry) or > P_RIGHTMOST(opaque) and I'm sure that it's correct to think that, in > theory, one of those things will eventually happen. Not in theory - only in practice. L&Y specifically state: "We wish to point out here that our algorithms do not prevent the possibility of livelock (where one process rrms indefinitely). This can happen if a process never terminates because it keeps having to follow link pointers created by other processes. This might happen in the case of a process being run on a (relatively) very slow processor in a multiprocessor system". > But the index > could be corrupted, most obviously by having a page where > opaque->btpo_next points pack to the current block number. If that > happens, you need an immediate shutdown (or some clever gdb hackery) > to terminate the VACUUM. That's unfortunate and unnecessary. Merlin reported a bug that looked exactly like this. Hardware failure may now explain the problem. > It also looks likes something we can fix, at a minimum by adding a > CHECK_FOR_INTERRUPTS() at the top of that loop, or in some function > that it calls, like _bt_getbuf(), so that if it goes into an infinite > loop, it can at least be killed. I think that it might be a good idea to have circular _bt_moveright() moves (the direct offender in Merlin's case, which has very similar logic to your _bt_getstackbuf() problem case) detected. I'm pretty sure that it's exceptional for there to be more than 2 or 3 retries in _bt_moveright(). It would probably be fine to consider the possibility that we'll never finish once we get past 5 retries or something like that. We'd then start keeping track of blocks visited, and raise an error when a page was visited a second time. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers