Re: [HACKERS] Pluggable storage

Peter Geoghegan Mon, 17 Jul 2017 16:36:07 -0700

On Mon, Jul 17, 2017 at 1:24 PM, Alexander Korotkov
<a.korot...@postgrespro.ru> wrote:
> It's probably depends on particular storage (once we have pluggable
> storages).  Some storages would have additional level of indirection while
> others wouldn't.


Agreed. Like kill_prior_tuple, it's an optional capability, and where
implemented is implemented in a fairly consistent way.

> But even if unique index contain no true duplicates, it's
> still possible that true delete happen.  Then we still have to delete tuple
> even from unique index.

I think I agree. I've been looking over the ARIES paper [1] again
today. They say this:

"For index updates, in the interest of increasing concurrency, we do
not want to prevent the space released by one transaction from being
consumed by another before the commit of the first transaction."

You can literally reclaim space from an index tuple deletion
*immediately* with their design, which matters because you want to
reclaim space as early as possible, before a page spit is needed.
Obviously they understand how important this is.

This might not work so well with an MVCC system, where there are no
2PL predicate locks. You need to keep a "ghost record", even for
non-unique indexes, and the deletion can only happen when the xact
commits. Remember, the block number cannot be used to see if there was
changes against the page, unlike the heap, because you have to worry
about page splits and page merges/deletion. UNDO is entirely logical
for indexes for this reason. (This is why UNDO does not actually undo
page splits, relation extension, etc. Only REDO/WAL always works at
the level of individual pages in all cases. UNDO for MVCC is not as
different to our design as I once thought.).

The reason I want to at least start with unique indexes is because you
need a TID to make non-unique/secondary indexes have unique keys
(unique keys are always needed if retail index tuple insertion is
always supported). For unique indexes, you really can do an update in
the index (see my design below for one example of how that can work),
but I think you need something more like a deletion followed by an
insertion for non-unique indexes, because there the physical/heap TID
changed, and that's part of the key, and that might belong on a
different page. You therefore haven't really fixed the problem with
secondary indexes sometimes needing new index tuples even though user
visible attributes weren't updated.

You haven't fixed the problem with secondary index, unless, of course,
all secondary indexes have logical pointers to begin with, such as the
PK value. Then you only need to "insert and delete, not update" when
the PK value is updated or when a secondary index needs a new index
tuple with distinct user visible attribute values to the previous
version's -- you fix the secondary index problem. And, while your
"version chain overflow indirection" structure is basically something
that lives outside the heap, it is still only needed for one index,
and not all of them.

This new indirection structure is a really nice target for pruning,
because you can prune physical TIDs that no possible snapshot could
use, unlike with the heap, where EvalPlanQual() could make any heap
tuple visible to snapshots at or after the minimal snapshot horizon
implied by RecentGlobalXmin. And, because index scans on any index can
prune for everyone.

You could also do "true index deletes", as you suggest, but you'd need
to have ghost records there too, and you'd need an asynchronous
cleanup process to do the cleanup when the deleting xact committed.
I'm not sure if it's worth doing that eagerly. It may or may not be
better to hope for kill_prior_tuple to do the job for us. Not sure
where this leaves index-only scans on secondary indexes..."true index
deletes" might be justified by making index only scans work more often
in general, especially for secondary indexes with logical pointers.

I'm starting to think that you were right all along about indirect
indexes needing to store PK values. Perhaps we should just bite the
bullet...it's not like places like the bufpage.c index routines
actually know or care about whether or not the index tuple has a TID,
what a TID is, etc. They care about stuff like the header values of
index tuples, and the page ItemId array, but TID is, as you put it,
merely payload.

> It's possible to add indirection layer "on demand".  Thus, initially index
> tuples point directly to the heap tuple.  If tuple gets updates and doesn't
> fit to the page anymore, then it's moved to another place with redirect in
> the old place.  I think that if carefully designed, it's possible to
> guarantee there is at most one redirect.

This is actually what I was thinking. Here is a sketch:

When you start out, index tuples in nbtree are the same as today --
one physical pointer (TID). But, on the first update to a PK index,
they grow a new pointer, but this is not a physical/heap TID. It's a
pointer to some kind of indirection structure that manages version
chains. You end up with an index with almost exactly the same external
interface as today, with one difference: you tell nbtree if something
is an insert or update, at least for unique indexes. Of course, you
need to have something to update in the index if it's an update, and
nbtree needs to be informed what that is.

My first guess is that we should limit the number of TIDs to two in
all cases, and start with only one physical TID, because:

* The first TID can always be the latest version, which in practice is
all most snapshots care about.

* We want to sharply limit the worst case page bloat, because
otherwise you have the same basic problem. Some queries might be a bit
slower, but it's worth it to be confident that bloat can only get so
bad.

* Simpler "1/3 of a page" enforcement. We simply add
"sizeof(ItemPointerData)" to the calculation.

* Gray says that split points are sometimes better if they're the
average of the min and max keys on the page, rather than the point at
which each half gets the most even share of space. Big index tuples
are basically bad for this.

> But I sill think that evading arbitrary payload for indexes is delaying of
> inevitable, if only we want pluggable storages and want them to reuse
> existing index AMs.  So, for example, arbitrary payload together with
> ability to update this payload allows us to make indexes separately
> versioned (have separate garbage collection process more or less unrelated
> to heap).  Despite overhead caused by MVCC attributes, I think such indexes
> could give significant advantages in various workloads.

Yeah. Technically you could have some indirection to keep under 6
bytes when that isn't assured by the PK index tuple width, but it
probably wouldn't be worth it. TID is almost like just another
attribute. The more I look, the less I think that TID is this thing
that a bunch of code makes many assumptions about that we will never
find all of. *Plenty* of TIDs today do not point to the heap at all.
For example, internal pages in nbtree uses TIDs that point to the
level below.

You would break some code within indextuple.c, but that doesn't seem
so bad. IndexInfoFindDataOffset() already has to deal with
variable-width NULL bitmaps. Why not a variabke-length pointer, too?

[1] 
https://pdfs.semanticscholar.org/39e3/d058a5987cb643e000bce555676d71be1c80.pdf

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

Reply via email to