Hi, On 2023-01-26 15:36:52 -0800, Peter Geoghegan wrote: > On Thu, Jan 26, 2023 at 12:45 PM Andres Freund <and...@anarazel.de> wrote: > > > Most of the overhead of FREEZE WAL records (with freeze plan > > > deduplication and page-level freezing in) is generic WAL record header > > > overhead. Your recent adversarial test case is going to choke on that, > > > too. At least if you set checkpoint_timeout to 1 minute again. > > > > I don't quite follow. What do you mean with "record header overhead"? Unless > > that includes FPIs, I don't think that's that commonly true? > > Even if there are no directly observable FPIs, there is still extra > WAL, which can cause FPIs indirectly, just by making checkpoints more > frequent. I feel ridiculous even having to explain this to you.
What does that have to do with "generic WAL record overhead"? I also don't really see how that is responsive to anything else in my email. That's just as true for the current gating condition (the issuance of an FPI during heap_page_prune() / HTSV()). What I was wondering about is whether we should replace the fpi_before != pgWalUsage.wal_fpi with records_before != pgWalUsage.wal_records && !WouldIssueFpi(page) > > The problematic case I am talking about is when we do *not* emit a WAL > > record > > during pruning (because there's nothing to prune), but want to freeze the > > table. If you don't log an FPI, the remaining big overhead is that > > increasing > > the LSN on the page will often cause an XLogFlush() when writing out the > > buffer. > > > > I don't see what your reference to checkpoint timeout is about here? > > > > Also, as I mentioned before, the problem isn't specific to > > checkpoint_timeout > > = 1min. It just makes it cheaper to reproduce. > > That's flagrantly intellectually dishonest. Sure, it made it easier to > reproduce. But that's not all it did! > > You had *lots* of specific numbers and technical details in your first > email, such as "Time for vacuuming goes up to ~5x. WAL volume to > ~9x.". But you did not feel that it was worth bothering with details > like having set checkpoint_timeout to 1 minute, which is a setting > that nobody uses, and obviously had a multiplicative effect. That > detail was unimportant. I had to drag it out of you! The multiples were for checkpoint_timeout=5min, with '250s' instead of WHERE ts < now() - '120s' I started out with checkpoint_timeout=1min, as I wanted to quickly test my theory. Then I increased checkpoint_timeout back to 5min, adjusted the query to some randomly guessed value. Happened to get nearly the same results. I then experimented more with '1min', because it's less annoying to have to wait for 120s until deletions start, than to wait for 250s. Because it's quicker to run I thought I'd share the less resource intensive version. A mistake as I now realize. This wasn't intended as a carefully designed benchmark, or anything. It was a quick proof for a problem that I found obvious. And it's not something worth testing carefully - e.g. the constants in the test are actually quite hardware specific, because the insert/seconds rate is very machine specific, and it's completely unnecessarily hardware intensive due to the use of single-row inserts, instead of batched operations. It's just a POC. > You basically found a way to add WAL overhead to a system/workload > that is already in a write amplification vicious cycle, with latent > tipping point type behavior. > > There is a practical point here, that is equally obvious, and yet > somehow still needs to be said: benchmarks like that one are basically > completely free of useful information. If we can't agree on how to > assess such things in general, then what can we agree on when it comes > to what should be done about it, what trade-off to make, when it comes > to any similar question? It's not at all free of useful information. It reproduces a problem I predicted repeatedly, that others in the discussion also wondered about, that you refused to acknowledge or address. It's not a good benchmark - I completely agree with that much. It was not designed to carefully benchmark different settings or such. It was designed to show a problem. And it does that. > > You're right, it makes sense to consider whether we'll emit a > > XLOG_HEAP2_VISIBLE anyway. > > As written the page-level freezing FPI mechanism probably doesn't > really stand to benefit much from doing that. Either checksums are > disabled and it's just a hint, or they're enabled and there is a very > high chance that we'll get an FPI inside lazy_scan_prune rather than > right after it is called, when PD_ALL_VISIBLE is set. I think it might be useful with logged hint bits, consider cases where all the tuples on the page were already fully hinted. That's not uncommon, I think? > > > > A less aggressive version would be to check if any WAL records were > > > > emitted > > > > during heap_page_prune() (instead of FPIs) and whether we'd emit an FPI > > > > if we > > > > modified the page again. Similar to what we do now, except not > > > > requiring an > > > > FPI to have been emitted. > > > > > > Also way more aggressive. Not nearly enough on its own. > > > > In which cases will it be problematically more aggressive? > > > > If we emitted a WAL record during pruning we've already set the LSN of the > > page to a very recent LSN. We know the page is dirty. Thus we'll already > > trigger an XLogFlush() during ringbuffer replacement. We won't emit an FPI. > > You seem to be talking about this as if the only thing that could > matter is the immediate FPI -- the first order effects -- and not any > second order effects. * Freeze the page when heap_prepare_freeze_tuple indicates that at least * one XID/MXID from before FreezeLimit/MultiXactCutoff is present. Also * freeze when pruning generated an FPI, if doing so means that we set the * page all-frozen afterwards (might not happen until final heap pass). */ if (pagefrz.freeze_required || tuples_frozen == 0 || (prunestate->all_visible && prunestate->all_frozen && fpi_before != pgWalUsage.wal_fpi)) That's just as true for this. What I'd like to know is why the second order effects of the above are lesser than for if (pagefrz.freeze_required || tuples_frozen == 0 || (prunestate->all_visible && prunestate->all_frozen && records_before != pgWalUsage.wal_records && !WouldIssueFpi(page))) > You certainly didn't get to 9x extra WAL > overhead by controlling for that before. Should I take it that you've > decided to assess these things more sensibly now? Out of curiosity: > why the change of heart? Dude. What would the point have been to invest a lot of time in a repro for a predicted problem? It's a problem repro, not a carefully designed benchmark. > > > In any case this seems like an odd thing for you to say, having > > > eviscerated a patch that really just made the same behavior trigger > > > independently of FPIs in some tables, controlled via a GUC. > > > > jdksjfkjdlkajsd;lfkjasd;lkfj;alskdfj > > > > That behaviour I critizied was causing a torrent of FPIs and additional > > dirtying of pages. My proposed replacement for the current FPI check > > doesn't, > > because a) it only triggers when we wrote a WAL record b) It doesn't trigger > > if we would write an FPI. > > It increases the WAL written in many important cases that > vacuum_freeze_strategy_threshold avoided. Sure, it did have some > problems, but the general idea of adding some high level > context/strategies seems essential to me. I was discussing changing the conditions for the "oppportunistic pruning" logic, not about a replacement for the eager freezing strategy. Greetings, Andres Freund