Heikki Linnakangas wrote:
Tom Lane wrote:
I had an idea this morning that might be useful: back off the strength
of what we try to guarantee. Specifically, does it matter if we leak a
file on crash, as long as it isn't occupying a lot of disk space?
(I suppose if you had enough crashes to accumulate many thousands of
leaked files, the directory entries would start to be a performance drag,
but if your DB crashes that much you have other problems.) This leads
to the idea that we don't really need to protect the open(O_CREAT) per
se. Rather, we can emit a WAL entry *after* successful creation of a
file, while it's still empty. This eliminates all the issues about
logging an action that might fail. The WAL entry would need to include
the relfilenode and the creating XID. Crash recovery would track these
until it saw the commit or abort or prepare record for the XID, and if
it didn't find any, would remove the file.
That idea, like all other approaches based on tracking WAL records, fail
if there's a checkpoint after the WAL record (and that's quite likely to
happen if the file is large). WAL replay wouldn't see the file creation
WAL entry, and wouldn't know to track the xid. We'd need a way to carry
the information over checkpoints.
Yes, checkpoints would need to include a list of created-but-yet-uncommitted
files. I think the hardest part is figuring out a way to get that information
to the backend doing the checkpoint - my idea was to track them in shared
memory, but that would impose a hard limit on the number of concurrent
file creations. Not nice :-(
But wait... I just had an idea.
We already got such a central list of created-but-uncommited
files - pg_class itself. There is a small window between file creation
and inserting the name into pg_class - but as Tom says, if we leak it then,
it won't use up much space anyway.
So maybe we should just scan pg_class on VACUUM, and obtain a list of files
that are referenced only from DEAD tuples. Those files we can than safely
If we *do* want a strict no-leakage guarantee, than we'd have to update pg_class
before creating the file, and flush the WAL. If we take Alvaro's idea of storing
temporary relations in a seperate directory, we could skip the flush for those,
because we can just clean out that directory after recovery. Having to flush
the WAL when creating non-temporary relations doesn't sound too bad - those
operations won't occur very often, I'd say.
greetings, Florian Pflug
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings