Heikki Linnakangas wrote:
Couldn't we fix this by forcing a checkpoint before we commit the transaction
that created the new pg_class entry for the clustered table? Or rather, more
generally, before committing a transaction that created a new non-temporary
relfilenode but didn't WAL-log any subsequent inserts.
Unfortunately I don't see any easy way to fix it. One approach would be
to avoid reusing the relfilenodes until next checkpoint, but I don't see
any nice place to keep track of OIDs that have been dropped since last
Ok, here's one idea:
Instead of deleting the file immediately on commit of DROP TABLE, the
file is truncated to release the space, but not unlink()ed, to avoid
reusing that relfilenode. The truncated file can be deleted after next
Now, how does checkpoint know what to delete? We can use the fsync
request mechanism for that. When a file is truncated, a new kind of
fsync request, a "deletion request", is sent to the bgwriter, which
collects all such requests to a list. Before checkpoint calculates new
RedoRecPtr, the list is swapped with an empty one, and after writing the
new checkpoint record, all the files that were in the list are deleted.
We would leak empty files on crashes, but we leak files on crashes
anyway, so that shouldn't be an issue. This scheme wouldn't require
catalog changes, so it would be suitable for backpatching.
Any better ideas?
Thats of course a rather sledgehammer-like approach to this problem - but at
least for the backbranched the fix would be less intrusive...
regards, Florian Pflug
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?