PostgreSQL tables and indexes not infrequently become bloated, and
this bloat is difficult to reverse once it has occurred.  When a row
is updated, a new row version is created and the space occupied by the
old version can be eventually be recycled.  Since there is always a
short delay and sometimes a very long delay (hours) between the time
when the new row version is created and the time the old row version
can be removed, tables grow to accommodate storing both the old and
new row versions.  When the old versions are reclaimed, there is no
easy way to shrink the table, because the free space is spread through
the table's data blocks rather than concentrated at the end of the
file.  Moreover, since the space used by old row versions are only
reclaimed lazily -- either by HOT pruning or by VACUUM -- the table
may be further extended even space for the new row version could have
been made available by more timely cleanup.  Indexes have similar
issues: even though in theory there is more latitude to consolidate
free space, we don't always do so, and lazy reclamation of row
versions may cause relation extension.  One factor contributing to
index bloat is that any non-HOT update to a heap tuple requires an
update to every index even if the indexed columns are not updated.

How can we get ahead of this problem?  One approach is to make it
easier to reorganize the table; for example, a variant of CLUSTER that
doesn't block concurrent DML would be useful.  However, that's really
only a stopgap, because it only makes it easier to salvage a table
where internal free space has become badly fragmented. What we want to
do is prevent free space from becoming fragmented in the first place.
Most other relational database systems do this by placing old row
versions in a separate data structure from the table itself where they
can easily be removed when no longer needed.  A row version becomes
"old" when it is updated or deleted; the "new" version enters the main
heap while the "old" version moves elsewhere.

To implement this in PostgreSQL, I believe we would need support for
UNDO.  Most database systems have both REDO -- what PostgreSQL
typically calls the write-ahead log (WAL) -- and UNDO.  REDO is
responsible for restoring physical consistency after a database crash,
while UNDO is responsible for reversing the effects of aborted
transactions.  When a transaction performs an operation (DO), it also
writes it to the write-ahead log (REDO) and records the information
needed to reverse it (UNDO).  If the transaction aborts, UNDO is used
to reverse all changes made by the transaction.  Both the actions
performed (DO) and the UNDO created for those actions are protected by
REDO.  Thus, UNDO must be written to disk at checkpoint time; UNDO
since the most recent checkpoint prior to a crash is regenerated by
WAL replay (REDO).  So, after a crash, we first replay WAL (REDO) and
then reverse out any transactions implicitly aborted by the crash
(UNDO). This basic DO-UNDO-REDO protocol has been well-understood for
decades.  It would likely be worthwhile to implement DO-UNDO-REDO in
PostgreSQL independently of any design for reducing bloat, because it
would provide a systematic framework for handling cleanup actions that
must be performed at the end of recovery. For example, if a
transaction creates a table and, while that transaction is still in
progress, there is an operating system or PostgreSQL crash, the
storage used by the table is permanently leaked.  This could be fixed
by having the operation that creates the table also emit UNDO which
would unlink the backing storage in the event of an abort.

We could have a single shared undo log to which all transactions
write, but the insertion point would almost certainly become a point
of contention. Moreover, it's not very desirable to interleave UNDO
from different transactions, because it makes cleanup difficult.
Suppose one transaction aborts while another continues to run; if
their UNDO records are interleaved, it will be difficult to recover
any space if one set of UNDO records can be discarded but the other
must be kept.  Instead, it seems best to have a separate UNDO log for
each transaction.  Each of these UNDO logs individually will be
written sequentially and will consist of a series of variable-length
records.  When a particular UNDO log is no longer needed, we can
remove the whole thing at once.  Unless the UNDO log spilled to disk
because it was large or because of a checkpoint during the
transaction, this is just a matter of deallocating or recycling
whatever shared memory we're using to store it; otherwise, we'll need
to remove or recycle the file, or the portion of a file, that was used
to persist it on disk.

Once we have UNDO, we could create a new type of heap where we emit
UNDO for each INSERT, UPDATE, or DELETE.  For an INSERT, we will
insert the new write and emit UNDO which will remove it.  For a
DELETE, we will immediately remove the old row version but emit UNDO
which will put it back.  For an UPDATE, we can proceed in two ways,
depending on the situation.  First, we can handle an UPDATE much as if
it were a DELETE combined with an INSERT, removing the old version,
adding the new one, and emitting UNDO to reverse both changes.
Second, in some cases, we can instead perform an in-place update,
modifying the existing record in place and emitting UNDO to restore
the old version.  It will be desirable to use the second strategy as
often as possible, because an in-place update does not bloat the heap.

Of course, there a few problems to solve.  Since the only way to find
an old row version is to look at UNDO, we will need to retain UNDO for
as long as it might contain row versions that some overlapping
transaction needs to see.  This means that we must retain UNDO for (1)
all transactions which are in progress, (2) for aborted transactions
until such time as all of the UNDO actions have been performed, and
(3) for committed transactions until they are all-visible.  Note that
(1) and (2) would be required no matter what; UNDO is fundamentally a
system for tracking actions to performed during transaction abort.
(3) is only required because of our desire to keep old row versions
out of the heap. We could limit the length of time for which UNDO in
category (3) is kept in order to implement "snapshot too old" for this
type of heap.  (Or we could keep it for an extra-long time to
implement time travel.)

Our current heap format requires each tuple to carry a very wide
header: 24-byte header -- or wider if the null bitmap is large. 18 of
those bytes plus assorted flag bits are only there for MVCC purposes.
For static data, all of that space is wasted.  Moreover, we can easily
end up dirtying the same page multiple times to set hint bits and,
later, to freeze XMIN and XMAX, resulting in repeated page writes.
With UNDO, we can do better.  If all UNDO for a given page have been
removed, then all aborted transactions have been undone and all
committed transactions are all-visible; thus, every tuple on that page
is all-visible.  Therefore, the page itself does not need to contain
any per-tuple visibility information; it only needs to be possible to
find any UNDO pertaining to that page.  Since the UNDO will be removed
at exactly the same time that the visibility information is no longer
needed, so we can store any required visibility information in the
UNDO itself.

Just as a particular WAL record is referenced using a byte position, a
particular UNDO record can be referenced by uniquely identifying the
transaction and the byte position within that transaction's UNDO log.
To uniquely identify a particular transaction, it seems we will need 8
bytes: the 4-byte epoch and the 4-byte XID.  To uniquely identify a
position within that transaction's UNDO log, we will perhaps need
another 8 bytes.  Thus, in total, an UNDO pointer will be 16 bytes.
We could perhaps reduce the size to 12 bytes by stipulating that a
transaction can only write 4GB of UNDO using a particular XID; if it
needed to generate more UNDO than that, it would need to allocate 1
additional XID for every 4GB of UNDO generated.  Pages in this new
type of heap will have space to store an UNDO pointer (epoch + XID +
byte offset) in the page header.  UNDO pointers where the XID is 0, 1,
or 2 cannot refer to a real UNDO log, because those XIDs are reserved
by PostgreSQL. Accordingly, we can reserve the all-zeroes UNDO pointer
as an invalid UNDO pointer.  Also, let's reserve all UNDO pointers
where the XID is 1 as temporary page data (TPD) pointers.  When a page
has been recently modified by only one transaction, the page's UNDO
pointer points to the most recent UNDO record generated by that
transaction for that page.  There's no need to reset the pointer when
the UNDO log is removed; instead, we interpret a pointer to an UNDO
log which no longer exists as a sure sign that the page is all-frozen.
When a page has been recently modified by more than one transaction,
the page's UNDO pointer is a TPD pointer.

A TPD pointer is a 64-bit reference to a TPD record.  An UNDO pointer
will be either 12 or 16 bytes, so there's definitely room to store 64
additional bits there in addition to the XID of 1.  TPD records are
stored in the TPD log, which is managed much like the write-ahead log:
records are accessed by byte offset, the log begins at byte position 0
and grows upward forever, and the older portions of the log can
discarded once they are no longer needed. Unlike WAL, however, an
already-written TPD record can be modified in place. New TPD records
and changes to TPD records must be protected by WAL.  A TPD will
contain one UNDO pointer per recent transaction (in progress, aborted
but not yet cleaned up, committed but not yet all-visible), indicating
the most recent change to the page by that transaction. Therefore,
it's always possible to find all of the recent transactions which have
modified a page and the most recent change by each one.  Each UNDO
record for a particular page should contain a back-pointer to the
prior change to that same page by the same transaction, if any, so
once we've found the most recent change to the page by each recent
transaction we can use these back-pointers to find all of the other
ones as well.  And that lets us find any old tuple versions we need.

If a transaction modifies a page whose existing UNDO pointer is
invalid, or one whose existing UNDO pointer points into its own UNDO
log, it can just overwrite the existing UNDO pointer with a pointer to
the new change. Likewise, if the UNDO pointer is a TPD pointer and the
modifying transaction is already present in the TPD record, it can
just update the TPD record with the UNDO pointer for the new
modification.  Alternatively, if the UNDO pointer is a TPD pointer and
one of the transactions in the TPD is no longer recent, it can grab
that slot in the existing TPD for its own UNDO pointer.  However, if
the page points to some other transaction's UNDO log and that
transaction is still recent, or if the page points to a TPD all of
whose members are still recent, then the modifying transaction must
create a new and larger TPD for the page.  The old TPD can then be
marked as garbage and reclaimed. Otherwise, a TPD entry can be
reclaimed when every UNDO pointer it contains points to an UNDO log
which has already been discarded.  We can treat a page with a pointer
to a TPD entry which has been discarded just like a page with a
pointer to an UNDO log that no longer exists: the page is all-visible.

In-place updates are fairly obviously safe and possible when no index
columns have been modified and the new tuple is no larger than the old
one.  The indexes don't need to know anything about the change, and
the new tuple will definitely fit.  In other cases, a bit more thought
is needed.  If the new tuple is larger than the old one, it may still
be possible to do the update in place.  The simplest case is when
there happens to be unused space following the existing tuple, and the
new and larger tuple will fit into that space.  In that case, we do
not need to do anything special. Alternatively, it may be that there's
enough free space elsewhere on the page to accommodate the new tuple.
In that case, we could take a cleanup lock and reorganize the page.
However, if we do, we must be careful to leave enough space to permit
a successful UNDO.  For example, imagine a completely full page where
we try to shrink one tuple in one transaction and enlarge a different
tuple in some other transaction.  If we allow both updates to be in
place, we will have a serious problem if the shrinking transaction
aborts and the enlarging transaction commits.

If any indexed columns are changed, an in-place update might confuse
the index AM.  Index AMs like BRIN which do not store TIDs don't care,
but those that do will get confused, because there will now be entries
for multiple index values pointing to the same TID.  That could
possibly be handled by forbidding index-only scans and requiring
rechecks in all cases, but that would be quite inefficient.  Instead,
we would likely wish to adopt the solution used by other database
systems which work on principles similar to those described here:
whenever we update or delete a tuple, use the old tuple to re-find any
index entries that might no longer be valid, and apply a delete-mark
to them.  When scanning an index, perform rechecks for all
delete-marked index items.  Claudio's proposal to keep otherwise-equal
btreek ys sorted by TID is probably essential for delete-marking to
perform well.  Note that delete-marking isn't strictly required: we
could simply choose not to do in-place updates any time an indexed
column is modified.  However, that leaves a lot of performance on the
table, because HOT updates are already fairly efficient; we want to
also reduce the bloat associated with updates that DO touch indexed
columns, and the write amplification that comes from touching every

So, what are the pros and cons of this system?

- Performing updates in place where possible prevents bloat from being
created. It does not prevent all bloat: aside from any updates that
are not dead in place, a transaction that inserts many rows and aborts
or deletes many rows and commits will still leave the table larger
than the optimum size.  However, many workloads have more updates than
they do inserts or deletes, so this is probably a significant win.

- Old tuple versions are removed from the heap eagerly (by UNDO
cleanup at the end of a transaction) rather than lazily (by HOT
pruning and VACUUM).  This increases the chance of being able to reuse
internal free space rather than extending the relation.  Also, it's
probably better to incur the overhead of cleaning up after an abort
immediately rather than hours, days, weeks, or months later when
VACUUM kicks in; VACUUM behavior often surprises users.  Better yet,
in the case where most transactions commit, we really don't really
need to do any after-the-fact cleanup at all; the only thing that we
have to do in that case is throw away the UNDO logs for transactions
that have become all-visible, which is hopefully a cheap operation.

- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today. The problem is more severe when TPD entries
are required.  There are probably some steps that can be taken to
mitigate this problem, such as setting aside a bit per tuple to mean
"this tuple has not been recently modified", or caching additional
details in the TPD when one is used, but it's unlikely to be possible
to eliminate it completely.

- Most things that could cause a page to be rewritten multiple times
are eliminated.  Tuples no longer need to be frozen; instead, pages
are implicitly frozen by the removal of associated UNDO.  Similarly,
hint bits are gone.  The page-level all-visible bit needs some
thought; we'll still need a visibility map to support index-only
scans, but we probably don't want the page level bit any more, because
rewriting the whole heap just to set that one bit would be terrible.

- Delete-marking makes deletes more expensive, because they have to
delete-mark all of the index entries.  Updates are more expensive when
all of the index columns change, because now each index needs to be
updated in two ways (new index entries and delete-marking of old ones)
rather than one.  However, if some indexed columns are updated but
only a small minority of the indexes on the table include those
columns, updates are substantially cheaper, because new index entries
and delete-marking of old ones are required only for those indexes
where some column has been updated.  On the whole, it seems likely
that we'd come out ahead here on many workloads, because updates that
change only one or a few indexed columns are probably much more common
than updates which change them all.

- Delete-marking assumes that we’re able to re-find index tuples.  The
system today does not require such an assumption.

- Tuple headers become much smaller.  The page header becomes bigger,
but there's only one page header per page, and there are many tuple
headers per page, so on average we should come out ahead.

- Transaction abort can be lengthy if much UNDO is required.  This
problem can be mitigated by pushing the UNDO steps into the


Robert Haas
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to