On Thu, Mar 19, 2026 at 1:17 PM Robert Haas <[email protected]> wrote: > > I would dig into why grison and schnauzer are failing this test, > > except that I don't agree that we should be running it in the > > first place. > > I'll go have a look.
grison failed like this: CREATE INDEX ON vaccluster(wrap_do_analyze(i)); +ERROR: indexes on virtual generated columns are not supported This is a surprising diff, because either that command is trying to create an index on a virtual generated column, or it's not, and it either is or isn't for every machine in the buildfarm and under test_plan_advice or not. The problem might be that the first TupleDescAttr(...) == ATTRIBUTE_GENERATED_VIRTUAL test in DefineIndex can be reached with attno == 0, which seems like it will then access memory that is not part of the TupleDesc. If that memory happens to contain a 'v' in the right spot, this will happen. (Is it not viable to have TupleDescAttr() assert that the second argument is >= 0?) skink has a failure that looks like this: +WARNING: supplied plan advice was not enforced +DETAIL: advice NESTED_LOOP_MEMOIZE(nt) feedback is "matched, failed" I think this is caused by a minor bug in the pgs_mask infrastructure. get_memoize_path() exits quickly when outer_path->parent->rows < 2, on the theory that the resulting path must lose on cost. But that presumes that we could do a plain nested loop instead, i.e. that PGS_NESTLOOP_PLAIN is set. And it might not be. Before the pgs_mask stuff, that case couldn't happen: enable_nestloop=off disabled all nested loops, and enable_memoize=off disabled only the memoized version, but there wasn't any way to disable only the non-memoized version (which, of course, was part of the point of the whole thing). I think the fix is as attached. Unfortunately, the other failures look like they are pointing to a rather more serious problem. schnauzer's got a failure that looks like this: +WARNING: supplied plan advice was not enforced +DETAIL: advice SEQ_SCAN(pg_trigger@exists_1) feedback is "matched, failed" +advice NO_GATHER(pg_trigger@exists_1) feedback is "matched, failed" And an older run on skink has a failure that looks like this: +WARNING: supplied plan advice was not enforced +DETAIL: advice SEQ_SCAN(pg_trigger@exists_6) feedback is "matched, failed" +advice NO_GATHER(pg_trigger@exists_6) feedback is "matched, failed" What these failures have in common is that both of them involve selecting from information_schema.views. What "matched, failed" means is that we saw the advice target during planning and tried to influence the plan, but then observed that the final plan doesn't respect the supplied advice. The query for information_schema.views involves three subplans, so the fact that exists_6 is mentioned here is a strong hint that the AlternativeSubPlan machinery is in play here. The query is planned once, and one of the two alternative subplans is chosen, generating advice for that plan. Upon replanning, the other plan is chosen, so the subplan for which we have plan advice doesn't appear in the query at all, leading to this. Now, you might wonder how that's possible, considering that we're planning the same query twice in a row with advice that isn't supposed to change anything. My guess is that it's possible because these machines are slow and other tests are running concurrently. If those other sessions execute DDL, they can send sinval messages, which can cause the second planning cycle to see different statistics than the first one. That then means the plan can change in any way except for what the advice system already knows how to control, and choice of AlternativeSubPlan is not in that set of things. I think I actually saw a failure similar to this once or twice locally during development, but that was back when the code had a lot of bugs, and I assumed that the failure was caused by some transient bug in whatever changes I was hacking on at the time, or some other bug that I fixed later, rather than being a real issue. I think the reason it doesn't happen very often is because the statistics have to change enough at just the right moment and even on slower buildfarm machines, most of the time, they don't. It's not really clear to me exactly where to place the blame for this category of failure. One view is that tests are being run in parallel and I didn't think about that, and therefore this is a defect in the test methodology that needs to be rectified somehow (hopefully not by throwing it out). We might be able to fix that by running the test suite serially rather than in parallel, although I expect that since you (Tom) are already unhappy with the time this takes, that will probably go over like a lead balloon. Another angle is to blame this on the decision to assign different plan names to the different subroots that we use for the alternative subplans. If we used exists_1, exists_2, and exists_3 twice each instead of exists_1..exists_6, this wouldn't happen, though that idea seems questionable on theoretical grounds. A third possible take is that not including choice-of-alternative-subplan in the initial scope was an error. As of this moment, I'm not really sure which way to jump. I need to think further about what to do about this one. We can continue the discussion about reducing the cost at the same time; again, I am definitely not saying that it isn't legitimate to be concerned about the CPU cycles expended running these tests, but those CPU cycles have found three separate problems in two days, which is not nothing. Separately, I am now 100% convinced that I need to go revise the pg_collect_advice patch, because that adds yet another run of the core regression tests, but for much less possibility of any real gain. I'll go get rid of that and figure out what, if anything, to replace it with. -- Robert Haas EDB: http://www.enterprisedb.com
v1-0001-get_memoize_path-Don-t-exit-quickly-when-PGS_NEST.nocfbot
Description: Binary data
