Hi, On 2019-08-20 17:11:38 +0530, Dilip Kumar wrote: > On Wed, Aug 14, 2019 at 10:35 PM Andres Freund <and...@anarazel.de> wrote: > > On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote: > > > On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <and...@anarazel.de> wrote: > > > I don't think we can normally pin the undo buffers properly at that > > stage. Without knowing the correct contents of the table page - which we > > can't know without holding some form of lock preventing modifications - > > we can't know how big our undo records are going to be. And we can't > > just have buffers that don't exist on disk in shared memory, and we > > don't want to allocate undo that we then don't need. So I think what > > we'd have to do at that stage, is to "pre-allocate" buffers for the > > maximum amount of UNDO needed, but mark the associated bufferdesc as not > > yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID > > would not be set. > > > > So at the start of a function that will need to insert undo we'd need to > > pre-reserve the maximum number of buffers we could potentially > > need. That reservation stage would > > > > a) pin the page with the current end of the undo > > b) if needed pin the page of older undo that we need to update (e.g. to > > update the next pointer) > > c) perform clock sweep etc to acquire (find or create) enough clean to > > hold the maximum amount of undo needed. These buffers would be marked > > as !BM_TAG_VALID | BUF_REFCOUNT_ONE. > > > > I assume that we'd make a) cheap by keeping it pinned for undo logs that > > a backend is actively attached to. b) should only be needed once in a > > transaction, so it's not too bad. c) we'd probably need to amortize > > across multiple undo insertions, by keeping the unused buffers pinned > > until the end of the transaction. > > > > I assume that having the infrastructure c) might also make some code > > for already in postgres easier. There's obviously some issues around > > guaranteeing that the maximum number of such buffers isn't high. > > > I have analyzed this further, I think there is a problem if the > record/s will not fit into the current undo log and we will have to > switch the log. Because before knowing the actual record length we > are not sure whether the undo log will switch or not and which undo > log we will get. And, without knowing the logno (rnode) how we are > going to pin the buffers? Am I missing something?
That's precisely why I was suggesting (at the start of the quoted block above) to not associate the buffers with pages at that point. Instead just have clean, pinned, *unassociated* buffers. Which can be re-associated without any IO. > Apart from this while analyzing the other code I have noticed that in > the current PG code we have few occurrences where try to read buffer > under the buffer lock held. > 1. > In gistplacetopage > { > ... > for (; ptr; ptr = ptr->next) > { > /* Allocate new page */ > ptr->buffer = gistNewBuffer(rel); > GISTInitBuffer(ptr->buffer, (is_leaf) ? F_LEAF : 0); > ptr->page = BufferGetPage(ptr->buffer); > ptr->block.blkno = BufferGetBlockNumber(ptr->buffer); > } > 2. During page split we find new buffer while holding the lock on the > current buffer. > > That doesn't mean that we can't do better but I am just referring to > the existing code where we already have such issues. Those are pretty clearly edge-cases, whereas the undo case at hand is a very common path. Note again that heapam.c goes to considerably trouble to never do this for common cases. Greetings, Andres Freund