In the "Redesigning checkpoint_segments" thread, many people opined that there should be a hard limit on the amount of disk space used for WAL: http://www.postgresql.org/message-id/CA+TgmoaOkgZb5YsmQeMg8ZVqWMtR=6s4-ppd+6jiy4oq78i...@mail.gmail.com. I'm starting a new thread on that, because that's mostly orthogonal to redesigning checkpoint_segments.

The current situation is that if you run out of disk space while writing WAL, you get a PANIC, and the server shuts down. That's awful. We can try to avoid that by checkpointing early enough, so that we can remove old WAL segments to make room for new ones before you run out, but unless we somehow throttle or stop new WAL insertions, it's always going to be possible to use up all disk space. A typical scenario where that happens is when archive_command fails for some reason; even a checkpoint can't remove old, unarchived segments in that case. But it can happen even without WAL archiving.

I've seen a case, where it was even worse than a PANIC and shutdown. pg_xlog was on a separate partition that had nothing else on it. The partition filled up, and the system shut down with a PANIC. Because there was no space left, it could not even write the checkpoint after recovery, and thus refused to start up again. There was nothing else on the partition that you could delete to make space. The only recourse would've been to add more disk space to the partition (impossible), or manually delete an old WAL file that was not needed to recover from the latest checkpoint (scary). Fortunately this was a test system, so we just deleted everything.

So we need to somehow stop new WAL insertions from happening, before it's too late.

Peter Geoghegan suggested one method here: http://www.postgresql.org/message-id/flat/cam3swzqcynxvpaskr-pxm8deqh7_qevw7uqbhpcsg1fpsxk...@mail.gmail.com. I don't think that exact proposal is going to work very well; throttling WAL flushing by holding WALWriteLock in WAL writer can have knock-on effects on the whole system, as Robert Haas mentioned. Also, it'd still be possible to run out of space, just more difficult.

To make sure there is enough room for the checkpoint to finish, other WAL insertions have to stop some time before you completely run out of disk space. The question is how to do that.

A naive idea is to check if there's enough preallocated WAL space, just before inserting the WAL record. However, it's too late to check that in XLogInsert; once you get there, you're already holding exclusive locks on data pages, and you are in a critical section so you can't back out. At that point, you have to write the WAL record quickly, or the whole system will suffer. So we need to act earlier.

A more workable idea is to sprinkle checks in higher-level code, before you hold any critical locks, to check that there is enough preallocated WAL. Like, at the beginning of heap_insert, heap_update, etc., and all similar indexam entry points. I propose that we maintain a WAL reservation system in shared memory. First of all, keep track of how much preallocated WAL there is left (and try to create more if needed). Also keep track of a different number: the amount of WAL pre-reserved for future insertions. Before entering the critical section, increase the reserved number with a conservative estimate (ie. high enough) of how much WAL space you need, and check that there is still enough preallocated WAL to satisfy all the reservations. If not, throw an error or sleep until there is. After you're done with the insertion, release the reservation by decreasing the number again.

A shared reservation counter like that could become a point of contention. One optimization is keep a constant reservation of, say, 32 KB for each backend. That's enough for most operations. Change the logic so that you check if you've exceeded the reserved amount of space *after* writing the WAL record, while you're holding WALInsertLock anyway. If you do go over the limit, set a flag in backend-private memory indicating that the *next* time you're about to enter a critical section where you will write a WAL record, you check again if more space has been made available.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to