On Mon, 2007-02-12 at 09:24 +0530, Pavan Deolasee wrote:
> On 2/12/07, Heikki Linnakangas <[EMAIL PROTECTED]> wrote:
> Hannu Krosing wrote:
> > Ühel kenal päeval, P, 2007-02-11 kell 12:35, kirjutas Tom
> >> Hannu Krosing <[EMAIL PROTECTED]> writes:
> >>> What if we would just reuse the root tuple directly
> instead of turning
> >>> it into a stub ?
> >>> This would create a cycle of ctid pointers, which changes
> the lookup
> >>> process from 'follow ctid chaint until the end' to 'follow
> the tid chain
> >>> until you reach the start'.
> >> How do you know which one is newest?
> > By xmin,cmin of course .
> >> What happens when you have to put a newer version off-page
> for lack of space?
> > Then this scheme won't work.
> Couldn't the ctid of the latest tuple point to the off-page
> tuple as usual?
> It could. But then you may lose reference for older version(s). We can
> the whole page scans to locate older versions, but that might prove
> It might be acceptable, if it was only stored on those tuples
> that are
> (HOT) updated. But it's not clear to me what you're proposing
> to do with
> the field, anyway, Which tuples would have it, and what would
> it point to?
> My guess what Hannu is suggesting is to have a circular chain of
> tuples in a page. The index can point to any tuple in the chain. When
> update goes off-page or is a COLD update, t_ctid points to the newer
> as usual. So a tuple which is COLD updated would need two pointers,
> which points to the oldest version in the circular on-page chain and
> other which
> points to the new version.
I very much like Hannu's idea, but it does present some issues.
My feeling is we need to get some clarity on whether to go this route,
or not, because it takes us away from the pointer-swinging idea. To
maximise our chances of a something good for 8.3 we really need to pick
just one idea now. Maybe we've already done that, by default.
I've tried to analyse all aspects of the discussion to see if we can get
something out of this:
Hannu has described the possibility of having the index point to a tuple
version which isn't the earliest one on the block. That raises the issue
of how will we locate earlier tuple versions when the current Snapshot
cannot see the root tuple?
First off, we only need to locate the earlier tuple when the root was
re-placed by a HOT update - we can tell that, so we're good so far.
Second, we would like to be able to validate the xmax == xmin rule. If
the original root tuple was dead, so will the prior versions that are
off-block, so nobody will ever follow the chain to the root tuple again
as part of EvalPlanQual. So we need to check newroot.xmax ==
Third we need to locate the appropriate tuple version.
We have a number of approaches:
a) Block Seq Scan: using a scan to get there is possible, but not good.
The best mechanism is to scan the block backwards (i.e. highest item to
lowest item) picking up the tuple which points to the root, then
scanning backwards picking up the parts of the chain as they are seen
until we get to a visible tuple that we can validate as part of the
chain. If we scanned forwards we'd need to remember the whole chain as
we went, which would be less useful. *But* this doesn't allow us to
validate the xmax == xmin rule, so that kills this sub-option, IMHO.
b) Additional tuple fields
We could have additional fields on the root tuple to help us locate the
"true root" or oldest version of this tuple on the block. Those
additional header fields are only required on the root tuple - no other
tuple needs this info. So we need not complain about space
savings/losses - the comparison is between extra tuple headers and a
TupleStub, which is a full header width (24 bytes). This would need to
include a copy of the xmin of the original root tuple, to allow us to
validate the xmax == xmin rule (4 bytes). It would also need to include
an in-page pointer to the true root (2 bytes). So additional tuple
fields of 6 bytes would be added to the root tuple, probably maxaligning
to 8 more bytes than normal - a saving of 16 bytes in comparison with
the TupleStub approach. Since there is no separate TupleStub, we can
replace the root when required, not just at full table VACUUM time. The
presence, or not, of those fields can be marked using various info bits.
The NULL bitmap is effectively an optional header field, so the idea
seems workable - or at least as workable as pointer swinging.
The in-page pointer is only ever followed as a result of an index scan.
In all those cases we don't record the xmin, so we don't check it - we
only need to store the xmin of the tuple pointed to by the in-page
pointer. There may be some off-block tuple versions that point at the
new root, but those links will never be followed because they are
by-definition dead rows.
I didn't start this analysis with any preconceived outcome, but ISTM now
that Hannu's idea can be made to work robustly, whilst avoiding the need
for pointer swinging, which thus allows retail VACUUMs, *and* saving
space on-block overall. What it requires is two optional fields on any
root tuple that has been replaced following a HOT update. If that is
acceptable, then we have a better design as a result of Hannu's
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?