On Mon, Jan 9, 2017 at 7:50 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > Okay, I see the point. I think here UNDO pointer can be marked > invalid either during page-pruning or vacuum.
I explicitly want to avoid that, because it means re-dirtying the page. The UNDO pointer becomes invalid by removing the UNDO log to which it points; the page itself does not need to be dirtied for the UNDO pointer to become invalid. > Another point which I think is not discussed till now is, how the > locking will work. As of now, we first lock the tuple and then xid on > which we have to wait. In the new design, we don't know the xid which > has updated the tuple and I am not sure there is any easy way to find > that information. Yes, there is: you can get it from the UNDO pointer. Each UNDO pointer is of the form <epoch, XID, offset>. If the page has a regular UNDO pointer, then you can get the XID for any modified table right out of the UNDO pointer itself. If the page has a TPD pointer, the TPD contains a series of UNDO pointers and you can get all of the XIDs that have touched the page by iterating through those UNDO pointers. There is some work to do to make it cheap to find which XID pertains to which updated tuple, but I think that problem can be solved by choosing carefully what to store in the TPD. I think, in fact, that for performance it's absolutely essential to solve that problem well. > One idea could be that we have some fixed number of > slots (i think we can make it variable as well, but for simplicity, > lets consider it as fixed) in the page header which will store the > offset to the transaction id inside a TPD entry of the page. Consider > a TPD entry of page contains four transactions, so we will just store > enough information in heap page header to reach the transaction id for > these four transactions. I think each such page header slot could be > three or four bits long depending upon how many concurrent > transactions we want to support on a page after which a new > transaction has to wait (I think in most workloads supporting > simultaneous eight transactions on a page should be sufficient). > Then we can have an additional byte (or less than byte) in the tuple > header to store lock info which is nothing but an offset to the slot > in the page header. We might find some other locking technique as > well, but I think keeping it same as current has benefit. Yes, something like this can be done. You don't really need any new page-level header data, because you can get the XIDs from the TPD entry (or from the page itself if there's only one). But you could expand the single "is-modified" bit that I've proposed adding to each tuple to multiple bits. 0 means not recently modified. 1 means modified by the first or only transaction that has recently modified the page. 2 means modified by the second transaction that has recently modified the page. Etc. What I was thinking about doing instead is storing an array in the TPD containing the same information. There would be one byte or one half a byte or whatever per TID and it would contain the index of the XID in the TPD that had most recently modified or locked that TID. Your solution might be better, though, at least for cases where the number of tuples that have modified the page is small. However, I'm not totally sure. I think it's important to keep the tuple headers VERY small, like 3 bytes. Or 2 bytes. Or maybe even variable size but only 1 byte in common cases. So I expect bit space in those places to be fairly scarce and precious. Now that might be the wrong idea -- maybe it's worth expanding that header in order to speed things up. On the other hand, having to read more database pages in order to process the same amount of data is *really* expensive, especially when you start talking about data sets that probably don't fit in memory, like a 10 TB or 100 TB table. If you've got 100 tuples per page (~81 bytes per tuple), increasing the size of a tuple by 1 byte causes that 100 TB table to increase in size by about 1.2 TB (modulo padding effects). An extra byte of space (or even an extra ten bytes) doesn't matter much for a table with a million tuples in it because the whole table is going to fit in memory either way, but when you have 10+ billion tuples those extra bytes start to matter a LOT. And the problem is not only or even primarily the space consumption - the issue is that all of your queries run that much slower because of the extra I/O. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers