A colleague at EnterpriseDB today ran into a situation on PostgreSQL 9.3.5 where the server went into an infinite loop while attempting a VACUUM FREEZE; it couldn't escape _bt_getstackbuf(), and it couldn't be killed with ^C. I think we should add a check for interrupts into that loop somewhere; and possibly make some attempt to notice if we've been iterating for longer than, say, the lifetime of the universe until now.
The fundamental structure of that function is an infinite loop. We break out of that loop when BTEntrySame(item, &stack->bts_btentry) or P_RIGHTMOST(opaque) and I'm sure that it's correct to think that, in theory, one of those things will eventually happen. But the index could be corrupted, most obviously by having a page where opaque->btpo_next points pack to the current block number. If that happens, you need an immediate shutdown (or some clever gdb hackery) to terminate the VACUUM. That's unfortunate and unnecessary. It also looks likes something we can fix, at a minimum by adding a CHECK_FOR_INTERRUPTS() at the top of that loop, or in some function that it calls, like _bt_getbuf(), so that if it goes into an infinite loop, it can at least be killed. We could also onsider adding a check at the bottom of the loop, just before setting blkno = opaque->btpo_next, that those values are unequal. If they are, elog(). Clearly it's possible to have a cycle of length >1, and such a check wouldn't catch that, but it might still be worth checking for the trivial case. Or, we could try to put an upper bound on the number of iterations that are reasonable and error out if we exceed that value. That might be tricky, though; it's not obvious to me that there's any comfortably small upper bound. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers