Re: New strategies for freezing, advancing relfrozenxid early

Andres Freund Thu, 26 Jan 2023 18:38:06 -0800

Hi,

On 2023-01-26 15:36:52 -0800, Peter Geoghegan wrote:
> On Thu, Jan 26, 2023 at 12:45 PM Andres Freund <and...@anarazel.de> wrote:
> > > Most of the overhead of FREEZE WAL records (with freeze plan
> > > deduplication and page-level freezing in) is generic WAL record header
> > > overhead. Your recent adversarial test case is going to choke on that,
> > > too. At least if you set checkpoint_timeout to 1 minute again.
> >
> > I don't quite follow. What do you mean with "record header overhead"? Unless
> > that includes FPIs, I don't think that's that commonly true?
>
> Even if there are no directly observable FPIs, there is still extra
> WAL, which can cause FPIs indirectly, just by making checkpoints more
> frequent. I feel ridiculous even having to explain this to you.


What does that have to do with "generic WAL record overhead"?


I also don't really see how that is responsive to anything else in my
email. That's just as true for the current gating condition (the issuance of
an FPI during heap_page_prune() / HTSV()).

What I was wondering about is whether we should replace the
  fpi_before != pgWalUsage.wal_fpi
with
  records_before != pgWalUsage.wal_records && !WouldIssueFpi(page)


> > The problematic case I am talking about is when we do *not* emit a WAL 
> > record
> > during pruning (because there's nothing to prune), but want to freeze the
> > table. If you don't log an FPI, the remaining big overhead is that 
> > increasing
> > the LSN on the page will often cause an XLogFlush() when writing out the
> > buffer.
> >
> > I don't see what your reference to checkpoint timeout is about here?
> >
> > Also, as I mentioned before, the problem isn't specific to 
> > checkpoint_timeout
> > = 1min. It just makes it cheaper to reproduce.
>
> That's flagrantly intellectually dishonest. Sure, it made it easier to
> reproduce. But that's not all it did!
>
> You had *lots* of specific numbers and technical details in your first
> email, such as "Time for vacuuming goes up to ~5x. WAL volume to
> ~9x.". But you did not feel that it was worth bothering with details
> like having set checkpoint_timeout to 1 minute, which is a setting
> that nobody uses, and obviously had a multiplicative effect. That
> detail was unimportant. I had to drag it out of you!

The multiples were for checkpoint_timeout=5min, with
 '250s' instead of WHERE ts < now() - '120s'

I started out with checkpoint_timeout=1min, as I wanted to quickly test my
theory. Then I increased checkpoint_timeout back to 5min, adjusted the query
to some randomly guessed value. Happened to get nearly the same results.

I then experimented more with '1min', because it's less annoying to have to
wait for 120s until deletions start, than to wait for 250s. Because it's
quicker to run I thought I'd share the less resource intensive version. A
mistake as I now realize.


This wasn't intended as a carefully designed benchmark, or anything. It was a
quick proof for a problem that I found obvious. And it's not something worth
testing carefully - e.g. the constants in the test are actually quite hardware
specific, because the insert/seconds rate is very machine specific, and it's
completely unnecessarily hardware intensive due to the use of single-row
inserts, instead of batched operations.  It's just a POC.



> You basically found a way to add WAL overhead to a system/workload
> that is already in a write amplification vicious cycle, with latent
> tipping point type behavior.
>
> There is a practical point here, that is equally obvious, and yet
> somehow still needs to be said: benchmarks like that one are basically
> completely free of useful information. If we can't agree on how to
> assess such things in general, then what can we agree on when it comes
> to what should be done about it, what trade-off to make, when it comes
> to any similar question?

It's not at all free of useful information. It reproduces a problem I
predicted repeatedly, that others in the discussion also wondered about, that
you refused to acknowledge or address.

It's not a good benchmark - I completely agree with that much. It was not
designed to carefully benchmark different settings or such. It was designed to
show a problem. And it does that.



> > You're right, it makes sense to consider whether we'll emit a
> > XLOG_HEAP2_VISIBLE anyway.
>
> As written the page-level freezing FPI mechanism probably doesn't
> really stand to benefit much from doing that. Either checksums are
> disabled and it's just a hint, or they're enabled and there is a very
> high chance that we'll get an FPI inside lazy_scan_prune rather than
> right after it is called, when PD_ALL_VISIBLE is set.

I think it might be useful with logged hint bits, consider cases where all the
tuples on the page were already fully hinted. That's not uncommon, I think?


> > > > A less aggressive version would be to check if any WAL records were 
> > > > emitted
> > > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI 
> > > > if we
> > > > modified the page again. Similar to what we do now, except not 
> > > > requiring an
> > > > FPI to have been emitted.
> > >
> > > Also way more aggressive. Not nearly enough on its own.
> >
> > In which cases will it be problematically more aggressive?
> >
> > If we emitted a WAL record during pruning we've already set the LSN of the
> > page to a very recent LSN. We know the page is dirty. Thus we'll already
> > trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI.
>
> You seem to be talking about this as if the only thing that could
> matter is the immediate FPI -- the first order effects -- and not any
> second order effects.

         * Freeze the page when heap_prepare_freeze_tuple indicates that at 
least
         * one XID/MXID from before FreezeLimit/MultiXactCutoff is present.  
Also
         * freeze when pruning generated an FPI, if doing so means that we set 
the
         * page all-frozen afterwards (might not happen until final heap pass).
         */
        if (pagefrz.freeze_required || tuples_frozen == 0 ||
                (prunestate->all_visible && prunestate->all_frozen &&
                 fpi_before != pgWalUsage.wal_fpi))

That's just as true for this.

What I'd like to know is why the second order effects of the above are lesser
than for
        if (pagefrz.freeze_required || tuples_frozen == 0 ||
                (prunestate->all_visible && prunestate->all_frozen &&
                 records_before != pgWalUsage.wal_records && 
!WouldIssueFpi(page)))




> You certainly didn't get to 9x extra WAL
> overhead by controlling for that before. Should I take it that you've
> decided to assess these things more sensibly now? Out of curiosity:
> why the change of heart?

Dude.

What would the point have been to invest a lot of time in a repro for a
predicted problem? It's a problem repro, not a carefully designed benchmark.



> > > In any case this seems like an odd thing for you to say, having
> > > eviscerated a patch that really just made the same behavior trigger
> > > independently of FPIs in some tables, controlled via a GUC.
> >
> > jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj
> >
> > That behaviour I critizied was causing a torrent of FPIs and additional
> > dirtying of pages. My proposed replacement for the current FPI check 
> > doesn't,
> > because a) it only triggers when we wrote a WAL record b) It doesn't trigger
> > if we would write an FPI.
>
> It increases the WAL written in many important cases that
> vacuum_freeze_strategy_threshold avoided. Sure, it did have some
> problems, but the general idea of adding some high level
> context/strategies seems essential to me.

I was discussing changing the conditions for the "oppportunistic pruning"
logic, not about a replacement for the eager freezing strategy.


Greetings,

Andres Freund

Re: New strategies for freezing, advancing relfrozenxid early

Reply via email to