On Wed, Aug 14, 2019 at 12:27 PM Andres Freund <and...@anarazel.de> wrote:
> Hi,

> - I think there's two fairly fundamental, and related, problems with
>   the sequence outlined above:
>   - We can't search for target buffers to store undo data, while holding
>     the "table" page content locked.
>     The easy way to solve would be to require sites performing UNDO
>     logging to acquire victim pages before even acquiring other content
>     locks. Perhaps the better approach could be for the undo layer to
>     hold onto a number of clean buffers, and to keep the last page in an
>     already written to undo log pinned.
>   - We can't search for victim buffers for further undo data while
>     already holding other undo pages content locked. Doing so means that
>     while we're e.g. doing IO to clean out the new page, old undo data
>     on the previous page can't be read.
>     This seems easier to fix. Instead of PrepareUndoInsert() acquiring,
>     pinning and locking buffers, it'd need to be split into two
>     operations. One that acquires buffers and pins them, and one that
>     locks them.  I think it's quite possible that the locking operation
>     could just be delayed until InsertPreparedUndo().  But if we solve
>     the above problem, most of this might already be solved.

Basically, that means
- the caller should call PreparedUndoInsert before acquiring table
page content lock right? because the PreparedUndoInsert just compute
the size, allocate the space and pin+lock the buffers and for pinning
the buffers we must compute the size and allocate the space using undo
storage layer.
- So basically, if we delay the lock till InsertPreparedUndo and call
PrepareUndoInsert before acquiring table page content lock this
problem is solved?

Although I haven't yet analyzed the AM specific part that whether it's
always possible to call the PrepareUndoInsert(basically getting all
the undo record ready) before the page content lock.  But, I am sure
that won't be much difficult part.

> - To me the current split between the packed and unpacked UNDO record
>   formats makes very little sense, the reasoning behind having them is
>   poorly if at all documented, results in extremely verbose code, and
>   isn't extensible.
>   When preparing to insert an undo record the in-buffer size is computed
>   with UndoRecordHeaderSize() (needs to know about all optional data)
>   from within PrepareUndoInsert() (which has a bunch a bunch of
>   additional knowledge about the record format). Then during insertion
>   InsertPreparedUndo(), first copies the UnpackedUndoRecord into an
>   UndoPackContext (again needing ...), and then, via InsertUndoData(),
>   copies that in small increments into the respective buffers (again
>   needing knowledge about the complete record format, two copies
>   even). Beside the code duplication, that also means the memory copies
>   are very inefficient, because they're all done in tiny increments,
>   multiple times.
>   When reading undo it's smilar: UnpackUndoData(), again in small
>   chunks, reads the buffer data into an UndoPackContext (another full
>   copy of the unpacked record format). But then FinishUnpackUndo()
>   *again* copies all that data, into an actual UnpackedUndoRecord
>   (again, with a copy of the record format, albeit slightly different
>   looking).
>   I'm not convinced by Heikki's argument that we shouldn't have
>   structure within undo records. In my opinion that is a significant
>   weakness of how WAL was initially designed, and even after Heikki's
>   work, still is a problem.  But this isn't the right design either.
>   Even if were to stay with the current fixed record format, I think
>   the current code needs a substantial redesign:
>   - I think 'packing' during insertion needs to serialize into a char*
>     allocation during PrepareUndoInsert

 computing the size in parallel
>     (or perhaps in InsertPreparedUndo, but probably not).  The size of
>     the record should never be split across record boundaries
>     (i.e. we'll leave that space unused if we otherwise would need to
>     split the size).
I think before UndoRecordAllocate we need to detect this part that
whether the size of the record will start from the last byte of the
page and if so then allocate one extra byte for the undo record.  Or
always allocate one extra byte for the undo record for handling this
case.  And, in FinalizeUndoAdvance only pass the size how much we have
actually consumed.

  The actual insertion would be a maximally sized
>     memcpy() (so we'd as many memcpys as the buffer fits in, rather than
>     one for each sub-type of a record).
>     That allows to remove most of the duplicated knowledge of the record
>     format, and makes insertions faster (by doing only large memcpys
>     while holding exclusive content locks).
>   - When reading an undo record, the whole stage of UnpackUndoData()
>     reading data into a the UndoPackContext is omitted, reading directly
>     into the UnpackedUndoRecord. That removes one further copy of the
>     record format.
So we will read member by member to UnpackedUndoRecord?  because in
context we have at least a few headers packed and we can memcpy one
header at a time like UndoRecordHeader, UndoRecordBlock. But that just
a few of them so if we copy field by field in the UnpackedUndoRecord
then we can get rid of copying in context then copy it back to the
UnpackedUndoRecord.  Is this is what in your mind or you want to store
these structures (UndoRecordHeader, UndoRecordBlock) directly into

>   - To avoid having separate copies of the record format logic, I'd
>     probably encode it into *one* array of metadata. If we had
>     {offsetoff(UnpackedUndoRecord, member),
>      membersize(UnpackedUndoRecord, member),
>      flag}
>     we could fairly trivially remove most knowledge from the places
>     currently knowing about the record format.
Seems interesting.  I will work on this.

>   I have some vague ideas for how to specify the format in a way that is
>   more extensible, but with more structure than just a blob of data. But
>   I don't think they're there yet.
> - The interface to read undo also doesn't seem right to me. For one
>   there's various different ways to read records, with associated code
>   duplication (batch, batch in "one page" mode - but that's being
>   removed now I think, single record mode).
>   I think the batch mode is too restrictive. We might not need this
>   during the first merged version, but I think before long we're going
>   to want to be able to efficiently traverse all the undo chains we need
>   to determine the visibility of all tuples on a page. Otherwise we'll
>   cause a lot of additional synchronous read IO, and will repeatedly
>   re-fetch information, especially during sequential scans for an older
>   snapshot.  I think I briefly outlined this in an earlier email - my
>   current though is that the batch interface (which the non-batch
>   interface should just be a tiny helper around), should basically be a
>   queue of "to-be-fetched" undo records. When batching reading an entire
>   transaction, all blocks get put onto that queue. When traversing
>   multiple chains, the chains are processed in a breadth-first fashion
>   (by just looking at the queue, and pushing additional work to the
>   end). That allows to efficiently issue prefetch requests for blocks to
>   be read in the near future.

I need to analyze this part.

>   I think that batch reading should just copy the underlying data into a
>   char* buffer. Only the records that currently are being used by
>   higher layers should get exploded into an unpacked record. That will
>   reduce memory usage quite noticably (and I suspect it also drastically
>   reduce the overhead due to a large context with a lot of small
>   allocations that then get individually freed).

Ok, I got your idea.  I will analyze it further and work on this if
there is no problem.

  That will make the
>   sorting of undo a bit more CPU inefficient, because individual records
>   will need to be partially unpacked for comparison, but I think that's
>   going to be a far smaller loss than the win.
> - My reading of the current xact.c integration is that it's not workable
>   as is. Undo is executed outside of a valid transaction state,
>   exceptions aren't properly undone, logic would need to be duplicated
>   to a significant degree, new kind of critical section.

Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Reply via email to