Re: POC: Cleaning up orphaned files using undo logs

Thomas Munro Thu, 22 Aug 2019 23:04:58 -0700

On Wed, Aug 21, 2019 at 4:44 AM Andres Freund <and...@anarazel.de> wrote:
> On 2019-08-20 21:02:18 +1200, Thomas Munro wrote:
> > 1.  Anyone is allowed to try to read or write data at any UndoRecPtr
> > that has been allocated, through the buffer pool (though you'd usually
> > want to check it with UndoRecPtrIsDiscarded() first, and only rely on
> > the system I'm describing to deal with races).
> >
> > 2.  ReadBuffer() might return InvalidBuffer.  This can happen for a
> > cache miss, if the smgrread implementation wants to indicate that the
> > buffer has been discarded/truncated and that is expected (md.c won't
> > ever do that, but undofile.c can).
>
> Hm. This gives me a bit of a stomach ache. It somehow feels like a weird
> form of signalling.  Can't quite put my finger on why it makes me feel
> queasy.


Well, if we're going to allow concurrent access and discarding, there
has to be some part of the system where you can discover that the
thing you wanted is gone.  What's wrong with here?

Stepping back a bit... why do we have a new concept here?  The reason
we don't have anything like this for relations currently is that we
don't have live references to blocks that are expected to be
concurrently truncated, due to heavyweight locking.  But the whole
purpose of the undo system is to allow cheap truncation/discarding of
data that you *do* have live references to, and furthermore the
discarding is expected to be frequent.  The present experiment is
about trying to do so without throwing our hands up and using a
pessimistic lock.

At one point Robert and I discussed some kind of scheme where you'd
register your interest in a range of the log before you begin reading
(some kind of range locks or hazard pointers), so that you would block
discarding in that range, but the system would still allow someone to
read in the middle of the log while the discard worker concurrently
discards non-overlapping data at the end.  But I kept returning to the
idea that the buffer pool already has block-level range locking of
various kinds.  You can register your interest in a range by pinning
the buffers.  That's when you'll find out if the range is already
gone.  We could add an extra layer of range locking around that, but
it wouldn't be any better, it'd just thrash your bus a bit more, and
require more complexity in the discard worker (it has to defer
discarding a bit, and either block or go away and come back later).

> > 3.  UndoLogDiscard() uses DiscardBuffer() to invalidate any currently
> > unpinned buffers, and marks as BM_DISCARDED any that happen to be
> > pinned right now, so they can't be immediately invalidated.  Such
> > buffers are never written back and are eligible for reuse on the next
> > clock sweep, even if they're written into by a backend that managed to
> > do that when we were trying to discard.
>
> Hm. When is it legitimate for a backend to write into such a buffer? I
> guess that's about updating the previous transaction's next pointer? Or
> progress info?

Yes, previous transaction header's next pointer, and progress counter
during rollback.  We're mostly interested in the next pointer here,
because the progress counter update would normally not be updated at a
time when the page might be concurrently discarded.  The exception to
that is a superuser running CALL pg_force_discard_undo() (a
data-eating operation designed to escape a system that can't
successfully roll back and gets stuck, blowing away
not-yet-rolled-back undo records).

Here are some other ideas about how to avoid conflicts between
discarding and transaction header update:

1.  Lossy self-update-only: You could say that transactions are only
allowed to write to their own transaction header, and then have them
periodically update their own length in their own transaction header,
and then teach the discard worker that the length information is only
a starting point for a linear search for the next transaction based on
page header information.  That removes complexity for writers, but
adds complexity and IO and CPU to the discard worker.  Bleugh.

2.  Strict self-update-only:  We could update it as part of
transaction cleanup.  That is, when you commit or abort, probably at
some time when your transaction is still advertised as running, you go
and update your own transaction header with your the size.  If you
never reach that stage, I think you can fix it during crash recovery,
during the startup scan that feeds the rollback request queues.  That
is, if you encounter a transaction header with length == 0, it must be
the final one and its length is therefore known and can be updated,
before you allow new transactions to begin.  There are some
complications like backends that exit without crashing, which I
haven't thought about.  As Amit just pointed out to me, that means
that the update is not folded into the same buffer access as the next
transaction, but perhaps you can mitigate that by not updating your
header if the next header will be on the same page -- the next
transaction can do it safely then (this page with the insert pointer
on it can't be discarded).  As Dilip just pointed out to me, it means
that you always do an update that you might not never need to do if
the transaction is discarded, to which I have no answer.  Bleugh.

3.  Perhaps there is a useful middle ground between ideas 1 and 2: if
it's 0, the discard worker will perform a scan of page headers to
compute the correct value, but if it's non-zero it will consider it to
be correct and trust that value.  The extra work would only happen
after crash recovery or things like elog(FATAL, ...).

4.  You could keep one extra transaction around all the time.  That
is, because we know we only ever want to stamp the transaction header
of the previous transaction, don't let a transaction that hasn't been
stamped with a length yet be discarded.  But now we waste more space!

5.  You could figure out a scheme to advertise the block number of the
start of the previous transaction.  You could have an LWLock that you
have to take to stamp the transaction header of the previous
transaction, and UndoLogDiscard() only has to take the lock if it
wants to discard a range that overlaps with that block.  This avoids
contention for some workloads, but not others, so it seems like a half
measure, and again you still have to deal with InvalidBuffer when
reading.  It's basically range locking; the buffer pool is already a
kind of range locking scheme!

These schemes are all about avoiding conflicts between discarding and
writing, but you'd still have to tolerate InvalidBuffer for reads (ie
reading zheap records) with this scheme, so I suppose you might as
well just treat updates the same and not worry about any of the above.

> > 5.  Separating begin from discard allows the WAL logging for
> > UndoLogDiscard() to do filesystem actions before logging, and other
> > effects after logging, which have several nice properties if you work
> > through the various crash scenarios.
>
> Hm. ISTM we always need to log before doing some filesystem operation
> (see also my recent complaint Robert and I are discussing at the bottom
> of [1]). It's just that we can have a separate stage afterwards?
>
> [1] 
> https://www.postgresql.org/message-id/CA%2BTgmoZc5JVYORsGYs8YnkSxUC%3DcLQF1Z%2BfcpH2TTKvqkS7MFg%40mail.gmail.com

I talked about this a bit with Robert and he pointed out that it's
probably not actually necessary to WAL-log these operations at all,
now that 'begin' and 'end' (= physical storage range) have been
separated from 'discard' and 'insert' (active undo data range).
Instead you could do it like this:

1.  Maintain begin and end pointers in shared memory only, no WAL, no
checkpoint.
2.  Compute their initial values by scanning the filesystem at startup time.
3.  Treat (logical) discard and insert pointers as today; WAL before
shm, checkpoint.
4.  begin must be <= discard, and end must be >= insert, or else PANIC.

I'm looking into that.

> > So now I'd like to get feedback on the sanity of this scheme.  I'm not
> > saying it doesn't have bugs right now -- I've been trying to figure
> > out good ways to test it and I'm not quite there yet -- but the
> > concept.  One observation I have is that there were already code paths
> > in undoaccess.c that can tolerate InvalidBuffer in recovery, due to
> > the potentially different discard timing for DO vs REDO.  I think
> > that's a point in favour of this scheme, but I can see that it's
> > inconvenient to have to deal with InvalidBuffer whenever you read.
>
> FWIW, I'm far from convinced that those are currently quite right. See
> discussion pointed to above.

Yeah.  It seems highly desirable to make it so that all decisions
about whether a write to an undo block is required or should be
skipped are made on the primary, so that WAL reply just does what it's
told.  I am working on that.

-- 
Thomas Munro
https://enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

Reply via email to