On Mon, Jul 17, 2017 at 1:24 PM, Alexander Korotkov <a.korot...@postgrespro.ru> wrote: > It's probably depends on particular storage (once we have pluggable > storages). Some storages would have additional level of indirection while > others wouldn't.
Agreed. Like kill_prior_tuple, it's an optional capability, and where implemented is implemented in a fairly consistent way. > But even if unique index contain no true duplicates, it's > still possible that true delete happen. Then we still have to delete tuple > even from unique index. I think I agree. I've been looking over the ARIES paper [1] again today. They say this: "For index updates, in the interest of increasing concurrency, we do not want to prevent the space released by one transaction from being consumed by another before the commit of the first transaction." You can literally reclaim space from an index tuple deletion *immediately* with their design, which matters because you want to reclaim space as early as possible, before a page spit is needed. Obviously they understand how important this is. This might not work so well with an MVCC system, where there are no 2PL predicate locks. You need to keep a "ghost record", even for non-unique indexes, and the deletion can only happen when the xact commits. Remember, the block number cannot be used to see if there was changes against the page, unlike the heap, because you have to worry about page splits and page merges/deletion. UNDO is entirely logical for indexes for this reason. (This is why UNDO does not actually undo page splits, relation extension, etc. Only REDO/WAL always works at the level of individual pages in all cases. UNDO for MVCC is not as different to our design as I once thought.). The reason I want to at least start with unique indexes is because you need a TID to make non-unique/secondary indexes have unique keys (unique keys are always needed if retail index tuple insertion is always supported). For unique indexes, you really can do an update in the index (see my design below for one example of how that can work), but I think you need something more like a deletion followed by an insertion for non-unique indexes, because there the physical/heap TID changed, and that's part of the key, and that might belong on a different page. You therefore haven't really fixed the problem with secondary indexes sometimes needing new index tuples even though user visible attributes weren't updated. You haven't fixed the problem with secondary index, unless, of course, all secondary indexes have logical pointers to begin with, such as the PK value. Then you only need to "insert and delete, not update" when the PK value is updated or when a secondary index needs a new index tuple with distinct user visible attribute values to the previous version's -- you fix the secondary index problem. And, while your "version chain overflow indirection" structure is basically something that lives outside the heap, it is still only needed for one index, and not all of them. This new indirection structure is a really nice target for pruning, because you can prune physical TIDs that no possible snapshot could use, unlike with the heap, where EvalPlanQual() could make any heap tuple visible to snapshots at or after the minimal snapshot horizon implied by RecentGlobalXmin. And, because index scans on any index can prune for everyone. You could also do "true index deletes", as you suggest, but you'd need to have ghost records there too, and you'd need an asynchronous cleanup process to do the cleanup when the deleting xact committed. I'm not sure if it's worth doing that eagerly. It may or may not be better to hope for kill_prior_tuple to do the job for us. Not sure where this leaves index-only scans on secondary indexes..."true index deletes" might be justified by making index only scans work more often in general, especially for secondary indexes with logical pointers. I'm starting to think that you were right all along about indirect indexes needing to store PK values. Perhaps we should just bite the bullet...it's not like places like the bufpage.c index routines actually know or care about whether or not the index tuple has a TID, what a TID is, etc. They care about stuff like the header values of index tuples, and the page ItemId array, but TID is, as you put it, merely payload. > It's possible to add indirection layer "on demand". Thus, initially index > tuples point directly to the heap tuple. If tuple gets updates and doesn't > fit to the page anymore, then it's moved to another place with redirect in > the old place. I think that if carefully designed, it's possible to > guarantee there is at most one redirect. This is actually what I was thinking. Here is a sketch: When you start out, index tuples in nbtree are the same as today -- one physical pointer (TID). But, on the first update to a PK index, they grow a new pointer, but this is not a physical/heap TID. It's a pointer to some kind of indirection structure that manages version chains. You end up with an index with almost exactly the same external interface as today, with one difference: you tell nbtree if something is an insert or update, at least for unique indexes. Of course, you need to have something to update in the index if it's an update, and nbtree needs to be informed what that is. My first guess is that we should limit the number of TIDs to two in all cases, and start with only one physical TID, because: * The first TID can always be the latest version, which in practice is all most snapshots care about. * We want to sharply limit the worst case page bloat, because otherwise you have the same basic problem. Some queries might be a bit slower, but it's worth it to be confident that bloat can only get so bad. * Simpler "1/3 of a page" enforcement. We simply add "sizeof(ItemPointerData)" to the calculation. * Gray says that split points are sometimes better if they're the average of the min and max keys on the page, rather than the point at which each half gets the most even share of space. Big index tuples are basically bad for this. > But I sill think that evading arbitrary payload for indexes is delaying of > inevitable, if only we want pluggable storages and want them to reuse > existing index AMs. So, for example, arbitrary payload together with > ability to update this payload allows us to make indexes separately > versioned (have separate garbage collection process more or less unrelated > to heap). Despite overhead caused by MVCC attributes, I think such indexes > could give significant advantages in various workloads. Yeah. Technically you could have some indirection to keep under 6 bytes when that isn't assured by the PK index tuple width, but it probably wouldn't be worth it. TID is almost like just another attribute. The more I look, the less I think that TID is this thing that a bunch of code makes many assumptions about that we will never find all of. *Plenty* of TIDs today do not point to the heap at all. For example, internal pages in nbtree uses TIDs that point to the level below. You would break some code within indextuple.c, but that doesn't seem so bad. IndexInfoFindDataOffset() already has to deal with variable-width NULL bitmaps. Why not a variabke-length pointer, too? [1] https://pdfs.semanticscholar.org/39e3/d058a5987cb643e000bce555676d71be1c80.pdf -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers