It seems possible to reduce overall WAL volume by roughly 25% on common
workloads by optimising the way UPDATE statements generate WAL. I've got
a practical proposal that looks like it can be completed in a few more
days work and fits very well with HOT UPDATEs.

This is likely to have beneficial performance effects for many
workloads, as well as having clear benefits for archive/backup by
reducing overall WAL volume.

A recent profile of the sources of WAL on a major OLTP benchmark:
- records is the count of WAL records of that type
- % records is the overall distribution by record type
- avg space is per record
- % space is the overall distribution by WAL volume

          tag           | records | % records | avg space | % space 
 HEAP_UPDATE            |  347822 |     25.91 |    298.72 |   53.30
 BTREE_INSERT_LEAF      |  586577 |     43.70 |     68.01 |   20.47
 HEAP_INSERT            |  184712 |     13.76 |    122.91 |   11.65
 BTREE_SPLIT_R          |    1435 |      0.11 |   7086.18 |    5.22
 HEAP_LOCK              |  159959 |     11.92 |     58.77 |    4.82
 HEAP_UPDATE+INIT       |   12157 |      0.91 |    414.92 |    2.59
 XACT_COMMIT            |   29531 |      2.20 |     40.00 |    0.61
 BTREE_DELETE           |    2806 |      0.21 |    379.45 |    0.55
 HEAP_DELETE            |   13276 |      0.99 |     46.16 |    0.31
 BTREE_SPLIT_L          |      75 |      0.01 |   7234.32 |    0.28
 HEAP_INSERT+INIT       |    2305 |      0.17 |    126.69 |    0.15
 BTREE_INSERT_UPPER     |    1497 |      0.11 |     73.96 |    0.06
 BTREE_SPLIT_L_ROOT     |       1 |      0.00 |   6882.00 |    0.00
 CLOG_ZEROPAGE          |       1 |      0.00 |     32.00 |    0.00
 XLOG_CHECKPOINT_ONLINE |       6 |      0.00 |     72.00 |    0.00
 BTREE_NEWROOT          |       1 |      0.00 |     76.00 |    0.00
 XACT_ABORT             |     155 |      0.01 |     40.00 |    0.00
(17 rows)
[Thanks to Greg]

UPDATEs clearly generate most of the WAL. In this benchmark and in many
other cases, around 10-25% of the data on a row changes at each UPDATE,
which would be sufficient to reduce the overall WAL volume by 30-40%.
Not all of that potential is practically realisable, see below.

[Note also that this profile was performed *before* Heikki's recent
optimisation of WAL generated during b-tree splits.]


Previously, we've discussed the possibility that WAL can be reduced when
performing UPDATEs. Currently, when we UPDATE a row we send the whole
new row to WAL, whether or not the columns have changed.

If we send only the changed data to WAL then we would need to be able to
reconstruct the new tuple during recovery. The changed data is referred
to as a "tuple diff" or just diff.

Reconstruction would read the old tuple and apply the diff to construct
a new tuple, which would then be applied to the data block. This is only
practical when the old and new tuple are on the same block, since
otherwise this would be likely to increase the I/O required to perform
recovery - which is already I/O bound. So this fits nicely with HOT and
builds upon the knowledge that with HOT most UPDATEs occur on the same

Reconstruction is complicated by the absence of a relcache that
accurately describes the TupleDesc at the exact point of the UPDATE. 
However, the tupledesc is available when we *create* the diff, so we can
take advantage of that to make sure we optimise for NULL <-> not NULL
transitions during UPDATE. So we use the tupledesc to construct a
tupledesc-agnostic format which is a set of byte-by-byte instructions
rather than being specified in terms of attributes or custom types.

The diff instructions allow us to express operations very simply.

CREATE TABLE foo (col1 integer, col2 integer, 
                col3 varchar(50), col4 varhcar(50));
INSERT INTO foo (1, 1, repeat('abc',50));

UPDATE foo SET col2 = 100 WHERE col1 = 1;

will generate diff instructions (assuming 4 byte alignment for now)

COPY    4       (bytes from old to new tuple)
IGNORE  4       (bytes on old tuple)
ADD     4       (bytes from new tuple)
COPY    156     (bytes from old to new tuple)

These diff instructions are based upon the concept of delta instructions
from the VCDIFF standard.
This is tailored to this specific implementation, so although we use
delta instructions we don't use the VCDIFF format since it caters for
longer data which in our case will already have been TOASTed away.

With a terse instruction set the diff format can encode the diff
instructions in a small number of bytes, considerably reducing the WAL
volume yet without requiring complex diffing algorithms. 

The simplicity of the diff algorithm is important because this
introduces additional CPU and potentially contention also, since the
diff is calculated while the block is locked. As a result, it is
proposed that the diff would only be calculated when the new tuple
length is in excess of a hard-coded limit, probably 128 bytes - a very
simple heuristic. That seems high, but the current WAL volume is
disproportionately generated by very wide tables and this would give a
balance between overhead of generating the diff and the overhead from
the WAL volume itself.

Discussion Points:
- diff starts at 128 bytes? - (no additional parameters suggested)
- diff still taken whether or not we do full_page_writes


Implement a new XLOG_HEAP2_UPDATE_DIFF xlog record and appropriate
handling, rather than attempting to augment the existing
XLOG_HEAP_UPDATE code since that also handles MOVE.

I'm expecting there to be about a 100 suggestions for how to optimise
the encoding, so please remember the balance between encoding
time/effect and the WAL compression ratio. The good thing about the use
of the diff format is that it neatly separates creating the diff from
decoding it.

Initially, my suggestion is to do column-by-column byte comparison to
establish which bytes have changed. Its possible to enhance that to do
sub-attribute diffs, but I'm not suggesting that in this initial

Alignment is not an issue for creating or using the diff.

When there is no BEFORE trigger, we can optimise that further by
examining only the columns mentioned in the SET clause. That would
further speed up the diff, but its fairly fiddly and I haven't spent
time yet looking into that possibility. (Suggestions welcome!).

No performance testing has been performed as yet since I haven't written
much of the code so far.

Diff Format

The diff algorithm requirements seem to be:
- accuracy (must be non-lossy)
- speed
- space saving
- cope well with PostgreSQL var length changing values and NULLs

Looking at VCDIFF format and thinking some, I need the following
primitives in the diff storage format:
- COPY - copy the next N bytes from old to new record
- ADD - add the Record byte at the current new record location
- IGNORE - ignore the next N bytes of the old record, nothing copied

The format is similar, but not exactly same as HRL encoded bitmap data
using a header and a body array.

A header has length, number of instructions and a set of indicator bits.
The indicator bit specifies whether the instruction is an ADD or not.

ADD     1 set in header bit

        uint8   num bytes to insert into new tuple
        data bytes

COPY    0 set in header bit

        uint8   lo 7 bits number of bytes to COPY from old to new
                hi bit = 1 to indicate COPY

IGNORE  0 set in header bit
        uint8   lo 7 bits number of bytes to IGNORE in old tuple
                hi bit = 0 to indicate IGNORE

In the above example shown earlier we would store only 4 instruction
bytes, plus 4 changed data bytes, plus 8 byte diff header = 16 bytes,
rather than the 156 bytes.

Note that we are diffing the tuple body, though storing the tuple header
in full. We can tweak that some more in the future, maybe. Note also hat
the format is very terse and the words ADD, COPY and IGNORE are
conceptual only.

Adjacent columns that have similar instructions can be represented by a
single instruction, i.e. 4 integer columns that will be COPYd to the new
tuple can be represented as COPY 16. As a result of that, I'm proposing
that the header has sufficient instruction bits for at most 32
instructions. That helps keep the format simple, but it also allows us
to introduce a further heuristic: skip calculating complex diffs because
they probably aren't going to save much space anyhow.

Very long columns may need multiple instructions to represent them, but
that wastes very little space and should not present a problem.

Comments, with a view to immediate implementation?

  Simon Riggs             

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not

Reply via email to