Hi, (On phone, so crappy formatting and no source access)
On February 19, 2022 3:08:41 PM PST, Peter Geoghegan <p...@bowt.ie> wrote: >On Fri, Feb 18, 2022 at 5:00 PM Peter Geoghegan <p...@bowt.ie> wrote: >> Another testing strategy occurs to me: we could stress-test the >> implementation by simulating an environment where the no-cleanup-lock >> path is hit an unusually large number of times, possibly a fixed >> percentage of the time (like 1%, 5%), say by making vacuumlazy.c's >> ConditionalLockBufferForCleanup() call return false randomly. Now that >> we have lazy_scan_noprune for the no-cleanup-lock path (which is as >> similar to the regular lazy_scan_prune path as possible), I wouldn't >> expect this ConditionalLockBufferForCleanup() testing gizmo to be too >> disruptive. > >I tried this out, using the attached patch. It was quite interesting, >even when run against HEAD. I think that I might have found a bug on >HEAD, though I'm not really sure. > >If you modify the patch to simulate conditions under which >ConditionalLockBufferForCleanup() fails about 2% of the time, you get >much better coverage of lazy_scan_noprune/heap_tuple_needs_freeze, >without it being so aggressive as to make "make check-world" fail -- >which is exactly what I expected. If you are much more aggressive >about it, and make it 50% instead (which you can get just by using the >patch as written), then some tests will fail, mostly for reasons that >aren't surprising or interesting (e.g. plan changes). This is also >what I'd have guessed would happen. > >However, it gets more interesting. One thing that I did not expect to >happen at all also happened (with the current 50% rate of simulated >ConditionalLockBufferForCleanup() failure from the patch): if I run >"make check" from the pg_surgery directory, then the Postgres backend >gets stuck in an infinite loop inside lazy_scan_prune, which has been >a symptom of several tricky bugs in the past year (not every time, but >usually). Specifically, the VACUUM statement launched by the SQL >command "vacuum freeze htab2;" from the file >contrib/pg_surgery/sql/heap_surgery.sql, at line 54 leads to this >misbehavior. >This is a temp table, which is a choice made by the tests specifically >because they need to "use a temp table so that vacuum behavior doesn't >depend on global xmin". This is convenient way of avoiding spurious >regression tests failures (e.g. from autoanalyze), and relies on the >GlobalVisTempRels behavior established by Andres' 2020 bugfix commit >94bc27b5. We don't have a blocking path for cleanup locks of temporary buffers IIRC (normally not reachable). So I wouldn't be surprised if a cleanup lock failing would cause some odd behavior. >It's quite possible that this is nothing more than a bug in my >adversarial gizmo patch -- since I don't think that >ConditionalLockBufferForCleanup() can ever fail with a temp buffer >(though even that's not completely clear right now). Even if the >behavior that I saw does not indicate a bug on HEAD, it still seems >informative. At the very least, it wouldn't hurt to Assert() that the >target table isn't a temp table inside lazy_scan_noprune, documenting >our assumptions around temp tables and >ConditionalLockBufferForCleanup(). Definitely worth looking into more. This reminds me of a recent thing I noticed in the aio patch. Spgist can end up busy looping when buffers are locked, instead of blocking. Not actually related, of course. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.