Andres Freund wrote:
> On 2014-06-20 17:38:16 -0400, Alvaro Herrera wrote:

> > It seems to me that we need to keep the offsets files around until a
> > checkpoint has written the "oldest" number to WAL.  In other words we
> > need additional state in shared memory: (a) what we currently store
> > which is the oldest number as computed by vacuum (not safe to delete,
> > but it's the number that the next checkpoint must write), and (b) the
> > oldest number that the last checkpoint wrote (the safe deletion point).
> Why not just WAL log truncations? If we'd emit the WAL record after
> determining the offsets page we should be safe I think? That seems like
> easier and more robust fix? And it's what e.g. the clog does.

Yes, I think this whole thing would be simpler if we just wal-logged the
truncations, like pg_clog does.  But I would like to avoid doing that
for now, and do it in 9.5 only in the future.  As a backpatchable (to
9.4/9.3) fix, I propose we do the following:

1. have vacuum update MultiXactState->oldestMultiXactId based on the
minimum value of pg_database->datminmxid.  Since this value is saved in
pg_control, it is restored from checkpoint replay during recovery.

2. Keep track of a new value, MultiXactState->lastCheckpointedOldest.
This value is updated by CreateCheckPoint in a primary server after the
checkpoint record has been flushed, and by xlog_redo in a hot standby, to
be the MultiXactState->oldestMultiXactId value that was last flushed.

3. TruncateMultiXact() no longer receives a parameter.  Files are
removed based on MultiXactState->lastCheckpointedOldest instead.  

4. call TruncateMultiXact at checkpoint time, after the checkpoint WAL
record has been flushed, and at restartpoint time (just like today).
This means we only remove files that a prior checkpoint has already
registered as being no longer necessary.  Also, if a recovery is
interrupted before end of WAL (recovery target), the files are still
present.  So we no longer truncate during vacuum.

Another consideration for (4) is that right now we're only invoking
multixact truncation in a primary when we're able to advance
pg_database.datminmxid (see vac_update_datfrozenxid).  The problem is
that after a crash and subsequent recovery, pg_database might be updated
without removing pg_multixact files; this would mean that the next
opportunity to remove files would be far in the future, when the minimum
datminmxid is advanced again.  One way to fix that would be to have
every single call to vac_update_datfrozenxid() attempt multixact
truncation, but that seems wasteful since I expect vacuuming is more
frequent than checkpointing.

Álvaro Herrera      
PostgreSQL Development, 24x7 Support, Training & Services

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to