On Thu, Jan 12, 2017 at 8:56 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Jan 11, 2017 at 5:18 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
>> Moving further, I have thought about consistent reads and different
>> formats for storing undo with pros and cons of each format.
>> Consistent Reads
>> If the page is modified by any transaction that is not visible to read
>> transaction's snapshot, we have to reconstruct a copy of the page.
>> Now there could be multiple ways to achieve it:
>> 1. Apply the undo for the transactions that are not visible to read
>> transaction and keep the copy of the page.
>> 2. Apply the undo for all the transactions in TPD entry of a page.
>> I think we can't do second because this can create rows which are not
>> consistent due to apply of undo log for transactions which are visible
>> to read transaction. Now, where the first approach will work, but it
>> might turn out to be inefficient if not implemented carefully, as for
>> each read transaction we need to create separate copies of the page
>> where a different set of transactions are not visible to different
>> read transactions. Even for read transactions to which the same set of
>> transactions are not visible, we need some way to recognize the
>> version of the page that can be used, probably try with all the
>> version of pages and see if any of the existing versions can serve the
>> purpose. I think we also need some way to link the different versions
>> of the page in shared buffers so that before trying to create a new
>> version of the page we can look at linked versions.
>> Yet another consideration in this area could be to perform consistent
>> reads differently for index scans. For index scan, we will receive a
>> particular TID to read, so, in this case, we can try to reconstruct
>> the old image of that particular row. Now, whereas this sounds
>> efficient, but we might need to reconstruct the old image of rows
>> multiple times for different read transactions. The effect will be
>> magnificent when we have to perform this for multiple rows of a page.
>> I think it might be better to always construct a complete copy of page
>> if one or more transactions that have changed the page are not visible
>> to read snapshot.
> I was thinking about this whole area somewhat differently. We don't
> really need a complete imagine of the page. We just need to find the
> correct version of each tuple that is visible. So I would suggest
> that we consider avoiding page reconstruction altogether. A
> sequential scan iterates through each TID on the page and says "is
> this tuple visible?". Right now the answer is either "yes" or "no",
> but maybe the answer could be "yes", "no", or "look at this other
> version in the UNDO segment instead". The advantage of this approach
> is that we don't end up copying tuples into a new "reconstructed"
> page. That copying will take CPU time and the resulting page will
> consume cache space.
> Now, the problem with this is that we want to avoid walking through
> the whole UNDO chain for the page multiple times. But maybe there are
> ways to avoid that -- for starters, if the recently-modified bit on
> the page isn't set, we don't need to walk through the UNDO chain at
> all. For a sequential scan, we could maintain a backend-local cache
> of information that gets updated as we walk the chain. Perhaps
> there's no way to make these ideas work and we're going to end up
> doing some form of page reconstruction, but I think it's worth some
> thought anyway.
Sure, we can do that way and I agree that it is worth considering.
Few cases where it can be really costly is when undo chain overflows
to multiple pages and those pages doesn't exist in cache. I think the
reason why it is probable is that undo chain is not maintained for row
update, rather it is maintained for a page level changes for each
transaction. So, what this means is that if we have to traverse the
chain for record X, then I think it is quite possible that during
chain traversal, we will encounter undo of record Y, ofcourse we can
identify and ignore it, but certainly the chain can be much longer.
Another kind of cost could be construction of actual record from undo
record, for example if we consider to undo log only changed parts of
Update as we do for WAL, then we have to incur some record
construction cost as well. I think what can make it really costly is
repeating this work. Even, if we consider the above as a worse case,
I am not sure whether it will be cheap for short transactions like
pgbench as there could be cases where concurrent updates/reads needs
to traverse undo chains.
Having said that, if you don't feel strongly about caching the
information in such a way that it can be reused by different sessions,
then we can initially implement the system as outlined by you, then we
can extend it based on the performance characteristics.
> I mostly agree, but let's make this a little more precise. During
> normal running, when a transaction aborts, we will perform all of the
> undo actions associated with that transaction, and we will generate
> WAL for those changes. For example, if an insert adds a tuple to a
> page, and the containing transaction aborts, then there will be an
> undo entry to remove the tuple. But removing that tuple has to be
> WAL-logged, just as adding it was. When all actions for the
> transaction have been undone, we will remove the undo log for that
> transaction and the removal of the undo log will also be a WAL-logged
> event. Therefore, *during* recovery, we don't need to perform any
> undo actions, because by replaying the existing WAL, we are also
> replaying any undo actions which were done before the crash (or on the
> master). However, at the end of recovery, we might have some left
> over undo logs. Those represent transactions that were in progress at
> the moment of the crash (or at the moment the standby was promoted) or
> perhaps transactions which had already been aborted but for which undo
> was not yet complete (if undo had been complete, the log itself would
> have been removed). And the UNDO for those transactions must still be
> performed. So the protocol is: first redo, then undo. Jim Gray
> discusses this in his classic book:
>> I think we should
>> organize undo's in form of pages with a size similar to heap pages.
>> This will be helpful both for writing WAL and doing checkpoints for
>> undo where we expect pages. The access to UNDO can be via shared
> Sounds good.
>> I think we might want to keep the access to shared buffers
>> containing undo separate from heap/index pages so that they can't be
>> evicted based on access of heap pages.
> I don't see a reason to do that. I think one of the advantages of
> storing the UNDO pages in the same pool as shared_buffers is that the
> amount of UNDO that we store in there can float up and down according
> to demand. If the heap pages are being accessed more frequently than
> the UNDO pages then let the UNDO pages get evicted. If the reverse is
> the case, that's OK too.
Okay, not a problem. I was thinking to priortise UNDO buffers, so that
they can only be evicted during checkpoint. However, I think there
might not be much value in doing so.
>> Vacuum can delete or make the undo file reusable when the
>> corresponding transaction precedes RecentGlobalXmin.
> If the associated transaction commits, yes. If it aborts, then we can
> recycle it once undo is complete. But I think this should be the job
> of a dedicated process which also executes undo actions; I don't think
> it should be associated with vacuum.
Agreed that it will be good to keep a separate undo process. However,
I am not exactly clear about your point of executing undo actions in
background. Do you mean to say that the main backend will mark
transaction as aborted in clog, but undo actions will be done in
backend or you have something else in mind?
>> To decide which files can be removed/reused, Vacuum either needs to
>> traverse the TPD entries or need to find by traversing the files in
>> pg_undo directory. Here, along with the removal of undo file, we also
>> need some way to indicate that TPD entry/entries for that undo are
>> invalid. A simple way could be that if the corresponding undo file
>> doesn't exist, then we can consider it as invalid. However, that
>> might not work if we choose to store undo of multiple transactions in
>> one undo file.
> I think we want something close to what you describe as "a simple
> way". For a TPD entry that references an undo log, then the pointer
> is invalid if that undo log no longer exists. But let's not confuse
> the files that are used to store the undo logs with the logs
> themselves. Logically, an undo log comes into existence the first
> time we perform an operation that generates undo, and ceases to exist
> when we explicitly destroy it (either after replaying all the entries
> in the case of an abort, or after it becomes all-visible in the case
> of a commit). So we should be able to have data in shared memory at
> all times (including during recovery) that says which undo logs exist
> as of that moment in time -- and if appropriate, which files on disk
> contain data from those undo logs. In a lot of cases, the UNDO data
> will hopefully be stored only in shared_buffers and not in ANY disk
> file at all, because a short transaction should be able to create an
> UNDO log and write some data to it and commit and become all-visible
> and have the UNDO log disappear before a checkpoint happens. In such
> a case, we never need to talk to the file system at all.
Okay, it seems we can deduce it from trasnction status. If the
transaction is aborted, then we know undo log is invalid. If it is
in-progress, then there will be a valid undo log. If it is committed,
and all-visble (precedes RecentGlobalXmin), then the undo log will be
>> The main advantage of this approach is that this will allow removing
>> the *not* required undo space efficiently. The disadvantage is that
>> creating different files for each transaction can be costly and it can
>> lead to lot of undo files when there are some long running
>> transactions, of course, we can optimize by allowing to remove the
>> files in such cases and give user error "snapshot_too_old", but I
>> think still it can be a matter of concern. To alleviate this concern,
>> we can try to keep multiple transactions undo data in one file.
> Right. I think the trade-off here changes as the amount of undo for a
> single transaction increases. If a single transaction generates 50GB
> of undo, we don't want that undo to be mixed with undo from other
> transactions, because otherwise we might end up with a form of bloat
> in our undo space. That transaction, because it's big, should get its
> own file (or series of files). But if you have a lot of transactions
> which are each generating only a few kB of undo, it's likely better to
> keep all of that undo in a single file. Hopefully most of the time we
> will never write that undo to the filesystem at all, but at checkpoint
> time we will need to do so. If we're running a pgbench workload with
> 200 concurrent transactions and a checkpoint hits, we do not want to
> have to create, write, and fsync several hundred files.
By create, it seems you are thinking about having undo files as some
in-memory thing (space reserved in memory) and if required, then only
create the file? If so, then the point we have discussed above for
giving equal priority to undo pages in shared_buffers as heap pages is
relevant, because backend evicting undo page could be costly as
compare to heap page.
>> (b) Multiple transactions in one undo file - Here we have to choose
>> some arbitrary name for undo files, let's say undo_1, undo_2, etc. for
>> the sake of discussion. These files can be kept in pg_undo directory
>> under PGDATA. In this approach, we can choose to select set of four
>> or eight 8KB pages (page of the same size as heap page) per
>> transaction, we can call each such set of pages as one undo segment.
>> If one segment is not sufficient to store undo for one transaction,
>> then we can allocate next segment. Each segment can have segment
>> header in the first page which will contain previous segment address.
>> For the very first segment, previous segment address will be Nil, the
>> second segment will contain the address of the first segment, third,
>> will contain the address of second and so on. Previous segment
>> address is needed to replay undo for rollback. I have explicitly
>> avoided storing the next segment address as we don't need it.
> You need some efficient way of looking up an undo pointer and finding
> the corresponding record. For example, suppose you have an undo
> pointer that says <epoch 0, XID 123456, byte-offset 78901>. (You
> might get such a pointer from a heap page, or from a TPD entry, or
> from a back-pointer in some other UNDO entry.) You now need to find
> the page 9 of the UNDO log for that transaction (because 78901/8192 =
> 9). If that's still in shared_buffers, great; but if it's been dumped
> to the filesystem, you need a very quick way of finding it and getting
> it back into shared_buffers.
Won't it be efficient if we can do the same mapping between undo
byte-offset and undo file as we do for WAL ((WALWrite offset to file).
We can keep file size bigger than what we have for each WAL Segment?
Do we need something more?
Do we need epoch,xid in back-pointer for undo or only byte-offset will do?
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: