On 2019-08-14 14:48:07 +0530, Dilip Kumar wrote:
> On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <and...@anarazel.de> wrote:
> > - I think there's two fairly fundamental, and related, problems with
> >   the sequence outlined above:
> >
> >   - We can't search for target buffers to store undo data, while holding
> >     the "table" page content locked.
> >
> >     The easy way to solve would be to require sites performing UNDO
> >     logging to acquire victim pages before even acquiring other content
> >     locks. Perhaps the better approach could be for the undo layer to
> >     hold onto a number of clean buffers, and to keep the last page in an
> >     already written to undo log pinned.
> >
> >   - We can't search for victim buffers for further undo data while
> >     already holding other undo pages content locked. Doing so means that
> >     while we're e.g. doing IO to clean out the new page, old undo data
> >     on the previous page can't be read.
> >
> >     This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
> >     pinning and locking buffers, it'd need to be split into two
> >     operations. One that acquires buffers and pins them, and one that
> >     locks them.  I think it's quite possible that the locking operation
> >     could just be delayed until InsertPreparedUndo().  But if we solve
> >     the above problem, most of this might already be solved.
> Basically, that means
> - the caller should call PreparedUndoInsert before acquiring table
> page content lock right? because the PreparedUndoInsert just compute
> the size, allocate the space and pin+lock the buffers and for pinning
> the buffers we must compute the size and allocate the space using undo
> storage layer.

I don't think we can normally pin the undo buffers properly at that
stage. Without knowing the correct contents of the table page - which we
can't know without holding some form of lock preventing modifications -
we can't know how big our undo records are going to be. And we can't
just have buffers that don't exist on disk in shared memory, and we
don't want to allocate undo that we then don't need. So I think what
we'd have to do at that stage, is to "pre-allocate" buffers for the
maximum amount of UNDO needed, but mark the associated bufferdesc as not
yet valid. These buffers would have a pincount > 0, but BM_TAG_VALID
would not be set.

So at the start of a function that will need to insert undo we'd need to
pre-reserve the maximum number of buffers we could potentially
need. That reservation stage would

a) pin the page with the current end of the undo
b) if needed pin the page of older undo that we need to update (e.g. to
   update the next pointer)
c) perform clock sweep etc to acquire (find or create) enough clean to
   hold the maximum amount of undo needed. These buffers would be marked

I assume that we'd make a) cheap by keeping it pinned for undo logs that
a backend is actively attached to. b) should only be needed once in a
transaction, so it's not too bad. c) we'd probably need to amortize
across multiple undo insertions, by keeping the unused buffers pinned
until the end of the transaction.

I assume that having the infrastructure c) might also make some code
for already in postgres easier. There's obviously some issues around
guaranteeing that the maximum number of such buffers isn't high.

> - So basically, if we delay the lock till InsertPreparedUndo and call
> PrepareUndoInsert before acquiring table page content lock this
> problem is solved?
> Although I haven't yet analyzed the AM specific part that whether it's
> always possible to call the PrepareUndoInsert(basically getting all
> the undo record ready) before the page content lock.  But, I am sure
> that won't be much difficult part.

I think that is somewhere between not possible, and so expensive in a
lot of cases that we'd not want to do it anyway. You'd at leasthave to
first acquire a content lock on the page, mark the target tuple as
locked, then unlock the page, reserve undo, lock the table page,
actually update it.

> >   - When reading an undo record, the whole stage of UnpackUndoData()
> >     reading data into a the UndoPackContext is omitted, reading directly
> >     into the UnpackedUndoRecord. That removes one further copy of the
> >     record format.
> So we will read member by member to UnpackedUndoRecord?  because in
> context we have at least a few headers packed and we can memcpy one
> header at a time like UndoRecordHeader, UndoRecordBlock.

Well, right now you then copy them again later, so not much is gained by
that (although that later copy can happen without the content lock
held). As I think I suggested before, I suspect that the best way would
be to just memcpy() the data from the page(s) into an appropriately
sized buffer with the content lock held, and then perform unpacking
directly into UnpackedUndoRecord. Especially with the bulk API that will
avoid having to do much work with locks held, and reduce memory usage by
only unpacking the record(s) in a batch that are currently being looked

> But that just a few of them so if we copy field by field in the
> UnpackedUndoRecord then we can get rid of copying in context then copy
> it back to the UnpackedUndoRecord.  Is this is what in your mind or
> you want to store these structures (UndoRecordHeader, UndoRecordBlock)
> directly into UnpackedUndoRecord?

I at the moment see no reason not to?

> >


Andres Freund

Reply via email to