This proposal introduces the idea of "lossy index tuples", an idea
that provides two important optimizations for btrees and potentially
other index types stimulated by recent blog complaints.
The two optimizations are Clustered Indexes (mostly for INSERTs) and
Update Duplicate Removal (for non-HOT UPDATEs), discussed more later.
LITE can be used for either or both of those optimizations
independently, even though they are implemented in fairly similar
ways. So if one of those gets shot down the other is still potentially
If this has wings, I'll start tracking this as a wiki page. If it
doesn't have wings, at least we tried to respond with technical
solutions to external input.
=Lossy Index Tuple Enhancement=
The LITE concept is that if the 13th index tuple bit is set the index
tuple represents a collection of tuples on one heap page rather than a
pointer to a single root heap tuple. 13th bit is currently unused, so
this optimization is fully backwards compatible with all existing
indexes, we just start using them more efficiently over time.
Importantly, this means that the user doesn't need to do anything at
all to begin benefiting from LITE. It also means that much of the code
required is non-invasive to the index code. These benefits are much
better than inventing a new index type and leaving the user to figure
out how and when to use them.
We call the 13th bit the "lossy bit", using the same terminology from
bitmap index access, indicating that we no longer hold exact details
of the heap tuples the pointer refers to, nor do we do reference
When the lossy bit is set we no longer interpret the TID as a
(blocknumber, linepointer) we interpret the data as (blocknumber,
lite_flags). The lite_flags have one bit reserved to decide whether
the LITE tuple represents multiple duplicates or whether it represents
a range of values.
The remainder of the lite_flags allow us to store details of which
ranges of linepointers are covered by the LITE tuple, splitting the
block into 15 ranges of linepointers. We just XOR the new linepointer
position with the bitmap when we update the bitmask. (The ranges are
consecutive not striped, to ensure that heavily repetitive updates are
optimized). If anyone is worried about running out of index tuple
flags, then perhaps we can reserve a few of the 16 bits for additional
flags rather than use them all for lp ranges.
By splitting the index blocks into 15 ranges we will avoid the need
for scanning the whole block in most cases, so the performance
degradation from single pointer to complete-block-pointer should be
This optimization would still allow UNIQUE indexes. In fact this
optimization would almost never overlap with a unique index anyway,
since we seldom update the PK of a table. Lossy index tuples would not
be able to skip accessing the heap during index-only scans, but the
presence of covered index columns would increase the uniqueness and
very likely stop the index tuple being marked lossy.
The remainder of the description is split between the two use
cases/optimizations which are similar but would have different code
== Update Duplicate Removal ==
We want an optimization that reduces the effects of multiple UPDATEs
on the same block that have duplicate values caused because another
index column has been updated and a non-HOT index insert has been
generated. This optimizes the case where someone has 12 indexes yet
updates just one of them, so we optimize the 11 unnecessary updates,
As UPDATEs occur we request inserts into the index. If a lossy index
pointer already exists that covers the new linepointer then we skip
the index insert, optimizing writes to the index. If a lossy index
pointer already exists for that value but doesn't yet cover our
linepointer then we update the bitmask. If a non-lossy index pointer
already exists we set the lossy bit and update the bitmask to reflect
which linepointers the index entry covers. If no index pointer exists
we insert one. The first update on a row will not be optimized, but
the 2nd - 16th could be. These actions optimize away most of the
writes and WAL activity to the index data block since only 1 in every
16 updates would cause changes to the block (actually about 1 in 10 on
average). Most importantly, updates will cause far fewer index block
splits and reindexing will be less needed.
The same situation can occur if we have many INSERTs with same values
on the same block. This could lead to a reduction in size of the btree
for indexes with many duplicate values.
== Clustered Index ==
We want an optimization that allows us to index a consecutive range of
values on one block. Currently if we have values 1-100 on one heap
block, we would have 100 index pointers all pointing to the one heap
block. LITE would allow that to be indexed as a single index tuple, in
the common case.
As an INSERT starts a new block we begin with a base value of N.
Frequently we have the case where the next value in the table is N+1
and is inserted into the same block (or could be arranged to do so).
After say 100 INSERTs we now have values 1-100 on the heap block.
As the first INSERT occurs we insert a normal index tuple pointing to
it. As we INSERT a new value we can mark the index tuple as being an
RLITE tuple. Further inserts with a higher value on this data block
will be optimized away, as long as there is no later tuple
val1 -> block0
val100 -> block1
if we insert the value 85 then it would go to block0, if we insert the
value 105 it would go to block1 without creating an additional tuple.
The value -5 would create a new tuple.
Index entries that split block ranges would be more complex and
costly, though they are only seen infrequently in real world apps.
e.g. we might have values 1, 90, 91 on block0 and values 100, 105 on
block1. If we insert the value 25 into block2 we would then have an
index that looks like this
val1 -> block0
val25 -> block2
val90 -> block0
val100 -> block1
First, we would clearly benefit here if we route the heap insert for
value 25 to occur on block0. That routing may not be acceptable, or
the block itself may be full, so we must handle the block split. If a
value is inserted into a range, then we must re-read the heap block to
see how to handle it.
i.e. the index tuple val1 represents values from val1 to val100 and
points to block0. If we insert val25 then we must re-read block0 and
edit the index entries around it by creating two new index tuples val1
(representing 1-24) and val90 (representing 90-99).
This looks doable, but more complex, so I'm posting this now to allow
for discussion and for people to spot the holes.
== IndexScan ==
Note that the executor code for IndexScan appears identical between
the two optimizations. The difference between duplicate and range LITE
tuples is needed only at INSERT time (or UPDATE indexed column to a
When we do an IndexScan if we see a LITE tuple we do a scan of the
linepointer ranges covered by this index tuple (which might eventually
go to a full block scan). For example with bit1 set we would scan 16
linepointers (on an 8192 byte block) and with 2 bits set we would scan
32 linepointers etc.., not necessarily consecutive ranges. So the
IndexScan can return potentially many heap rows per index entry, which
need to be re-checked and may also need to be sorted prior to
returning them. If no rows are returned we can kill the index pointer,
just as we do now for btrees, so they will be removed eventually from
the index without the need for VACUUM. An BitmapIndexScan that sees a
lossy pointer can also use the lossy TID concept, so we can have
partially lossy bitmaps.
== VACUUM ==
VACUUM would not cause a problem since line pointer removal for
non-lossy index tuples would occur normally. Lossy index tuples would
be ignored by index vacuum, but they would not prevent removal of line
pointers from the heap.
== Code Footprint ==
All of this code is concentrated in the Scan nodes, and in btinsert().
There are no changes to WAL records or replay. The details are all in
the way we insert and interpret index tuples.
== Other designs ==
This is related in a few ways to Grouped Item Tuple indexes, a patch
Heikki worked on in 2007, but is not that close.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: