[HACKERS] Hard limit on WAL space used (because PANIC sucks)

Heikki Linnakangas Thu, 06 Jun 2013 07:03:13 -0700

In the "Redesigning checkpoint_segments" thread, many people opined thatthere should be a hard limit on the amount of disk space used for WAL:http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6s4-ppd+6jiy4oq78i...@mail.gmail.com.I'm starting a new thread on that, because that's mostly orthogonal toredesigning checkpoint_segments.

The current situation is that if you run out of disk space while writingWAL, you get a PANIC, and the server shuts down. That's awful. We cantry to avoid that by checkpointing early enough, so that we can removeold WAL segments to make room for new ones before you run out, butunless we somehow throttle or stop new WAL insertions, it's always goingto be possible to use up all disk space. A typical scenario where thathappens is when archive_command fails for some reason; even a checkpointcan't remove old, unarchived segments in that case. But it can happeneven without WAL archiving.

I've seen a case, where it was even worse than a PANIC and shutdown.pg_xlog was on a separate partition that had nothing else on it. Thepartition filled up, and the system shut down with a PANIC. Becausethere was no space left, it could not even write the checkpoint afterrecovery, and thus refused to start up again. There was nothing else onthe partition that you could delete to make space. The only recoursewould've been to add more disk space to the partition (impossible), ormanually delete an old WAL file that was not needed to recover from thelatest checkpoint (scary). Fortunately this was a test system, so wejust deleted everything.

So we need to somehow stop new WAL insertions from happening, beforeit's too late.

Peter Geoghegan suggested one method here:http://www.postgresql.org/message-id/flat/cam3swzqcynxvpaskr-pxm8deqh7_qevw7uqbhpcsg1fpsxk...@mail.gmail.com.I don't think that exact proposal is going to work very well; throttlingWAL flushing by holding WALWriteLock in WAL writer can have knock-oneffects on the whole system, as Robert Haas mentioned. Also, it'd stillbe possible to run out of space, just more difficult.

To make sure there is enough room for the checkpoint to finish, otherWAL insertions have to stop some time before you completely run out ofdisk space. The question is how to do that.

A naive idea is to check if there's enough preallocated WAL space, justbefore inserting the WAL record. However, it's too late to check that inXLogInsert; once you get there, you're already holding exclusive lockson data pages, and you are in a critical section so you can't back out.At that point, you have to write the WAL record quickly, or the wholesystem will suffer. So we need to act earlier.

A more workable idea is to sprinkle checks in higher-level code, beforeyou hold any critical locks, to check that there is enough preallocatedWAL. Like, at the beginning of heap_insert, heap_update, etc., and allsimilar indexam entry points. I propose that we maintain a WALreservation system in shared memory. First of all, keep track of howmuch preallocated WAL there is left (and try to create more if needed).Also keep track of a different number: the amount of WAL pre-reservedfor future insertions. Before entering the critical section, increasethe reserved number with a conservative estimate (ie. high enough) ofhow much WAL space you need, and check that there is still enoughpreallocated WAL to satisfy all the reservations. If not, throw an erroror sleep until there is. After you're done with the insertion, releasethe reservation by decreasing the number again.

A shared reservation counter like that could become a point ofcontention. One optimization is keep a constant reservation of, say, 32KB for each backend. That's enough for most operations. Change the logicso that you check if you've exceeded the reserved amount of space*after* writing the WAL record, while you're holding WALInsertLockanyway. If you do go over the limit, set a flag in backend-privatememory indicating that the *next* time you're about to enter a criticalsection where you will write a WAL record, you check again if more spacehas been made available.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Hard limit on WAL space used (because PANIC sucks)

Reply via email to