On Wed, Jan 11, 2017 at 5:18 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > Moving further, I have thought about consistent reads and different > formats for storing undo with pros and cons of each format. > > Consistent Reads > ---------------------------- > If the page is modified by any transaction that is not visible to read > transaction's snapshot, we have to reconstruct a copy of the page. > Now there could be multiple ways to achieve it: > > 1. Apply the undo for the transactions that are not visible to read > transaction and keep the copy of the page. > 2. Apply the undo for all the transactions in TPD entry of a page. > > I think we can't do second because this can create rows which are not > consistent due to apply of undo log for transactions which are visible > to read transaction. Now, where the first approach will work, but it > might turn out to be inefficient if not implemented carefully, as for > each read transaction we need to create separate copies of the page > where a different set of transactions are not visible to different > read transactions. Even for read transactions to which the same set of > transactions are not visible, we need some way to recognize the > version of the page that can be used, probably try with all the > version of pages and see if any of the existing versions can serve the > purpose. I think we also need some way to link the different versions > of the page in shared buffers so that before trying to create a new > version of the page we can look at linked versions. > > Yet another consideration in this area could be to perform consistent > reads differently for index scans. For index scan, we will receive a > particular TID to read, so, in this case, we can try to reconstruct > the old image of that particular row. Now, whereas this sounds > efficient, but we might need to reconstruct the old image of rows > multiple times for different read transactions. The effect will be > magnificent when we have to perform this for multiple rows of a page. > I think it might be better to always construct a complete copy of page > if one or more transactions that have changed the page are not visible > to read snapshot.
I was thinking about this whole area somewhat differently. We don't really need a complete imagine of the page. We just need to find the correct version of each tuple that is visible. So I would suggest that we consider avoiding page reconstruction altogether. A sequential scan iterates through each TID on the page and says "is this tuple visible?". Right now the answer is either "yes" or "no", but maybe the answer could be "yes", "no", or "look at this other version in the UNDO segment instead". The advantage of this approach is that we don't end up copying tuples into a new "reconstructed" page. That copying will take CPU time and the resulting page will consume cache space. Now, the problem with this is that we want to avoid walking through the whole UNDO chain for the page multiple times. But maybe there are ways to avoid that -- for starters, if the recently-modified bit on the page isn't set, we don't need to walk through the UNDO chain at all. For a sequential scan, we could maintain a backend-local cache of information that gets updated as we walk the chain. Perhaps there's no way to make these ideas work and we're going to end up doing some form of page reconstruction, but I think it's worth some thought anyway. > Undo Page Format > ------------------------------- > I think we can organize undo in multiple ways > > (a) One undo file per transaction - We can name each undo file as > epoch+xid or epch+xid.undo. These files can be kept in pg_undo > directory under PGDATA. In this approach, we have to store > latest_undo offset in file header to perform the rollback operation. > Also, this offset can be used to allocate the location for new undo > record. Apart from that undo entries can be organized one after > another. Each undo entry will have two back pointers, one to reach > the previous undo location which will be used for rollback and the > second one to maintain the prior undo chain for changes to a > particular page which will be used to reconstruct the copy of page. Sounds good so far. > During recovery, we need to read each undo file and check the > transaction status in 'clog', if it is not committed, then apply all > the undo starting from latest_undo position. I mostly agree, but let's make this a little more precise. During normal running, when a transaction aborts, we will perform all of the undo actions associated with that transaction, and we will generate WAL for those changes. For example, if an insert adds a tuple to a page, and the containing transaction aborts, then there will be an undo entry to remove the tuple. But removing that tuple has to be WAL-logged, just as adding it was. When all actions for the transaction have been undone, we will remove the undo log for that transaction and the removal of the undo log will also be a WAL-logged event. Therefore, *during* recovery, we don't need to perform any undo actions, because by replaying the existing WAL, we are also replaying any undo actions which were done before the crash (or on the master). However, at the end of recovery, we might have some left over undo logs. Those represent transactions that were in progress at the moment of the crash (or at the moment the standby was promoted) or perhaps transactions which had already been aborted but for which undo was not yet complete (if undo had been complete, the log itself would have been removed). And the UNDO for those transactions must still be performed. So the protocol is: first redo, then undo. Jim Gray discusses this in his classic book: https://www.amazon.com/dp/1558601902 > I think we should > organize undo's in form of pages with a size similar to heap pages. > This will be helpful both for writing WAL and doing checkpoints for > undo where we expect pages. The access to UNDO can be via shared > buffers. Sounds good. > I think we might want to keep the access to shared buffers > containing undo separate from heap/index pages so that they can't be > evicted based on access of heap pages. I don't see a reason to do that. I think one of the advantages of storing the UNDO pages in the same pool as shared_buffers is that the amount of UNDO that we store in there can float up and down according to demand. If the heap pages are being accessed more frequently than the UNDO pages then let the UNDO pages get evicted. If the reverse is the case, that's OK too. > Different TPD entries > containing the same XID will point to different locations in undo. > Undo location will be just a page offset for a particular page entry. I think an undo location should be a byte offset from the beginning of UNDO for that transaction. Which page that's on can be figured out by dividing by BLCKSZ. > Vacuum can delete or make the undo file reusable when the > corresponding transaction precedes RecentGlobalXmin. If the associated transaction commits, yes. If it aborts, then we can recycle it once undo is complete. But I think this should be the job of a dedicated process which also executes undo actions; I don't think it should be associated with vacuum. > To decide which files can be removed/reused, Vacuum either needs to > traverse the TPD entries or need to find by traversing the files in > pg_undo directory. Here, along with the removal of undo file, we also > need some way to indicate that TPD entry/entries for that undo are > invalid. A simple way could be that if the corresponding undo file > doesn't exist, then we can consider it as invalid. However, that > might not work if we choose to store undo of multiple transactions in > one undo file. I think we want something close to what you describe as "a simple way". For a TPD entry that references an undo log, then the pointer is invalid if that undo log no longer exists. But let's not confuse the files that are used to store the undo logs with the logs themselves. Logically, an undo log comes into existence the first time we perform an operation that generates undo, and ceases to exist when we explicitly destroy it (either after replaying all the entries in the case of an abort, or after it becomes all-visible in the case of a commit). So we should be able to have data in shared memory at all times (including during recovery) that says which undo logs exist as of that moment in time -- and if appropriate, which files on disk contain data from those undo logs. In a lot of cases, the UNDO data will hopefully be stored only in shared_buffers and not in ANY disk file at all, because a short transaction should be able to create an UNDO log and write some data to it and commit and become all-visible and have the UNDO log disappear before a checkpoint happens. In such a case, we never need to talk to the file system at all. > The main advantage of this approach is that this will allow removing > the *not* required undo space efficiently. The disadvantage is that > creating different files for each transaction can be costly and it can > lead to lot of undo files when there are some long running > transactions, of course, we can optimize by allowing to remove the > files in such cases and give user error "snapshot_too_old", but I > think still it can be a matter of concern. To alleviate this concern, > we can try to keep multiple transactions undo data in one file. Right. I think the trade-off here changes as the amount of undo for a single transaction increases. If a single transaction generates 50GB of undo, we don't want that undo to be mixed with undo from other transactions, because otherwise we might end up with a form of bloat in our undo space. That transaction, because it's big, should get its own file (or series of files). But if you have a lot of transactions which are each generating only a few kB of undo, it's likely better to keep all of that undo in a single file. Hopefully most of the time we will never write that undo to the filesystem at all, but at checkpoint time we will need to do so. If we're running a pgbench workload with 200 concurrent transactions and a checkpoint hits, we do not want to have to create, write, and fsync several hundred files. We want to write all of that undo to a single file or smaller number of larger files. > (b) Multiple transactions in one undo file - Here we have to choose > some arbitrary name for undo files, let's say undo_1, undo_2, etc. for > the sake of discussion. These files can be kept in pg_undo directory > under PGDATA. In this approach, we can choose to select set of four > or eight 8KB pages (page of the same size as heap page) per > transaction, we can call each such set of pages as one undo segment. > If one segment is not sufficient to store undo for one transaction, > then we can allocate next segment. Each segment can have segment > header in the first page which will contain previous segment address. > For the very first segment, previous segment address will be Nil, the > second segment will contain the address of the first segment, third, > will contain the address of second and so on. Previous segment > address is needed to replay undo for rollback. I have explicitly > avoided storing the next segment address as we don't need it. You need some efficient way of looking up an undo pointer and finding the corresponding record. For example, suppose you have an undo pointer that says <epoch 0, XID 123456, byte-offset 78901>. (You might get such a pointer from a heap page, or from a TPD entry, or from a back-pointer in some other UNDO entry.) You now need to find the page 9 of the UNDO log for that transaction (because 78901/8192 = 9). If that's still in shared_buffers, great; but if it's been dumped to the filesystem, you need a very quick way of finding it and getting it back into shared_buffers. > We also > need some form of freespace map to keep track of undo segments that > are freed. Once the vacuum decided to free the undo of a particular > transaction, it can add all the undo segments for that particular > transaction in freespace map. For allocating a new segment for > existing transaction or new transaction, we need to first search in > freespace map for the available segment, if not available, then > allocate a new segment. In the file, the first segment will contain > slots to hold the latest_undo location and XID (probably (epoch + > XID)) for each of the transaction which has undo in the file. We need > latest_undo location to rollback the transaction and XID is required > to rollback the transaction at the time of recovery. Let's call the > first segment as a HEAD segment. We can extend this HEAD segment if > there is more space in the file, but for simplicity and matter of > discussion, let's consider one undo file can hold the transactions > that can get a slot in the HEAD segment. Each slot can be 6 or 8 > bytes long. Considering each segment of size 32KB, we can hold undo > of ~4096 transactions in one file. We can keep some book-keeping > information to examine whether there is any empty slot in the first > segment of the file. So whenever a new transaction has to allocate an > undo, it checks in a HEAD segment of undo_1, if there is a space to > allocate new slot, allocate it, else try in next file undo_2. We do > need to consider what to do when the undo file is chosen to write undo > for a transaction is finished and transaction still has to write more > data. There are a couple of options like we can first try to free the > space for any existing transaction which is already committed and if > there is no more freespace available we can either choose to wait or > allocate an additional XID to write UNDO as proposed in the initial > e-mail in this thread. Like the previous approach, undo can be > accessed via shared buffers. Similarly, each undo entry will have two > back pointers as described in the previous approach. There are some > loose points like how freespace map will be organized which needs more > thoughts, but I think the core of the idea can be made to work. Yeah, I think the core of the idea is OK, but (1) I still think vacuum is irrelevant; it is the undo processes that matter here and (2) I still think that you're thinking to have too much connection between the logical concept of undo and where that undo is physically stored in files. I think the logical undo pointer should just be <epoch, XID, byte-offset> and then where that gets stored is a detail. So there should be no need to allocate a new XID when the file fills up; that's just a bookkeeping detail of how byte-offsets get mapped to physical file positions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers