On Tue, Nov 29, 2016 at 12:21 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: >> I think we need to avoid putting the visibility information in the >> index because that will make the index much bigger. > > I agree that there will be an increase in index size, but it shouldn't > be much if we have transaction information (and that too only active > transactions) at page level instead of at tuple level.
It depends on the workload. If you have something where the number of transactions per page is small, like a bulk load, that scheme may work out great. But if you have something like a pgbench workload with an extra index on the abalance column, you could have a different transaction for almost every tuple. Of course the question of how many of those will be recent is a good one; if it's not too many, maybe it's not that bad. > I think the > number of concurrent writes on the index will be lesser as compared to > the heap. There are a couple of benefits of having visibility > information in the index. > > a. Heap and index could be independently cleaned up and in most cases > by foreground transactions only. The work for vacuum will be quite > less as compared to now. I think this will help us in avoiding the > bloat to a great degree. > > b. Improved index-only scans, because now we don't need visibility map > of the heap to check tuple visibility. > > c. Some speed improvements for index scans can also be expected > because with this we don't need to perform heap fetch for invisible > index tuples. I see what you're going for, but I'm not sure it's worth it. I mean, say you just have one bit per index tuple. If it's set, the heap tuple is definitely all-visible; if it's clear, then the tuple may or may not be visible to your snapshot. You insert tuples with the bit clear, and clear the bit again when the tuple is deleted or the indexed column is updated. You set the bit when you follow the index pointer and discover that the tuple is all-visible, so that next time you don't need to follow it again. Of the advantages you mention above, this still allows (a), but less efficiently. It allows (b). It does not allow (c). So, it's not as good. But, it saves you the overhead of storing xmin and/or xmax for each tuple in the page. Even if you have an optimization to list the XIDs once per page and then refer to them from the tuples, you are going to need a byte in each tuple for the xmin-index and a byte for the xmax-index, or something like that, plus the space for the XID list itself. That's not a ton of space, maybe, but it's definitely more. It's also a significantly bigger change to the tuple format, I think. What I'm proposing could be done by repurposing the LP_* bits available in the line pointer. So I think the questions are (1) will the extra space consumption hurt us more than the benefit we will get? and (2) is it really worth the work to change the page format to that degree? As you can probably tell, I'm leaning toward "no". I think a bit per-tuple -- whether we call it delete-mark or definitely-all-visible -- should be enough to get most of the benefit that is available here for substantially less work and no extra space. > I think with proposed design there is a good chance that for many of > the index scans, there is a need for the extra hop to decide > visibility of index tuple. I think as of today, during index scan if > the corresponding heap page is not-all-visible, then we check the > corresponding heap tuple to decide about the visibility of index > tuple, whereas with proposed design, I think it first needs to check > heap page and then TPD, if there is more than one transaction > operating on page. There's a couple of possible designs here, but there is the possibility for extra hops in some cases. But there are things we can do to mitigate that. 1. If each tuple has a definitely-all-visible bit, you can check that first; if it's set, the tuple is definitely all-visible you don't need to do anything further. 2. If we still maintain a visibility map, you can check that next. If the bit is set for the target page, then the tuple is definitely all-visible and you don't need to do anything further. 3. Otherwise, you need to look at the heap page. But if the heap page's UNDO pointer is invalid, then the whole page is all-visible, so you're done. (You can also set the visibility map bit and/or the index tuple's definitely-all-visible bit to avoid having to recheck the heap page in the future.) 4. Each tuple will have a bit indicating whether it's been recently modified; that is, whether the transaction whose UNDO log is referenced by the page's UNDO pointer did anything to that tuple; or if the page points to a TPD entry, whether any of the transactions in the TPD modified that tuple. If that bit is not set for that particular tuple, it's all-visible and you're done. 5. Otherwise you have to look at the TPD entry (if any) and UNDO logs. If we put in the visibility information in the index as you are proposing, then we can always resolve the visibility of the index tuple at step 1; we just test xmin and xmax against the snapshot. For index-only scans, that's great, because we never need to consult the heap at all. For regular index scans, it's not nearly as great, because we're still going to need to consult the TPD and UNDO logs to determine which version of the tuple is visible to that snapshot. You can avoid looking at the heap in the case where no version of the tuple is visible to the scan, but that's probably mostly going to happen only in cases where the transaction is still in flight so in most cases the heap will be in shared_buffers anyway and you won't save very much. > Such a design will work, but I think this is more work to mark the > tuple as Dead as compared to current system. Anything that allows update-in-place requires touching the index entries when they are invalidated, right? That's the main cost. >>>> That's OK, but we're in real trouble if >>>> step-3 inserted an additional index tuple pointing to that chain >>>> rather than simply noting that one already exists. If it adds an >>>> additional one, then we'll visit two index tuples for that TID. Each >>>> time, the visibility information in UNDO will cause us to return the >>>> correct tuple, but we've erred by returning it twice (once per index >>>> entry) rather than just once. >>> >>> Why will scan traverse the UNDO chain if it starts after step-3 commit? >> >> Why wouldn't it? I think that if an index scan hits a page with an >> UNDO chain, it always need to traverse the whole thing. > > If you notice the scan has started after all the writes have > committed, so what is the guarantee that there will be a UNDO chain? > Yes, it could be there if there is some live transaction which started > before step, otherwise, I am not sure if we can guarantee that UNDO > chain will be present. The UNDO chain has to remain until all transactions that modified the page are all-visible, not just until the transactions are committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers