Hi, On 2022-12-08 16:36:07 -0800, Andres Freund wrote: > On 2022-12-08 16:15:11 -0800, Andres Freund wrote: > > Unfortunately cfbot shows that that doesn't work entirely reliably. > > > > The most frequent case is postgres_fdw, which somewhat regularly fails with > > a > > regression.diff like this: > > > > diff -U3 > > /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out > > /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out > > --- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out > > 2022-12-08 20:35:24.772888000 +0000 > > +++ > > /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out > > 2022-12-08 20:43:38.199450000 +0000 > > @@ -9911,8 +9911,7 @@ > > WHERE application_name = 'fdw_retry_check'; > > pg_terminate_backend > > ---------------------- > > - t > > -(1 row) > > +(0 rows) > > > > -- This query should detect the broken connection when starting new remote > > -- transaction, reestablish new connection, and then succeed.
> > > > I guess that a cache reset message arrives and leads to the connection being > > terminated. Unfortunately that's hard to see right now, as the relevant log > > messages are output with DEBUG3 - it's quite verbose, so enabling it for all > > tests will be painful. > > > > I *think* I have seen this failure locally at least once, when running the > > test normally. > > > > > > I'll try to reproduce this locally for a bit. If I can't, the only other > > idea > > I have for debugging this is to change log_min_messages in that section of > > the > > postgres_fdw test to DEBUG3. > > Oh. I tried to reproduce the issue, without success so far, but eventually my > test loop got stuck in something I reported previously and forgot about since: > https://www.postgresql.org/message-id/20220925232237.p6uskba2dw6fnwj2%40awork3.anarazel.de > > I wonder if the timing on the freebsd CI task works out to hitting a "smaller > version" of the problem that eventually resolves itself, which then leads to a > sinval reset getting sent, causing the observable problem. The issue referenced above is now fixed, and I haven't seen instances of it since then. I also just now fixed the issue that often lead to failing to upload logs. Unfortunately the fdw_retry_check issue from above has re-occurred since then: https://cirrus-ci.com/task/4901157940756480 https://api.cirrus-ci.com/v1/artifact/task/4901157940756480/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs Another run hit an issue we've been fighting repeatedly on the buildfarm / CI: https://cirrus-ci.com/task/5527490404286464 https://api.cirrus-ci.com/v1/artifact/task/5527490404286464/testrun/build/testrun/regress-running/regress/regression.diffs diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out --- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out 2023-02-06 23:52:43.627604000 +0000 +++ /tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out 2023-02-07 00:03:04.535232000 +0000 @@ -1930,12 +1930,13 @@ SELECT thousand, tenthous FROM tenk1 WHERE thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand; - QUERY PLAN -------------------------------------------------------- - Index Only Scan using tenk1_thous_tenthous on tenk1 - Index Cond: (thousand < 2) - Filter: (tenthous = ANY ('{1001,3000}'::integer[])) -(3 rows) + QUERY PLAN +-------------------------------------------------------------------------------------- + Sort + Sort Key: thousand + -> Index Only Scan using tenk1_thous_tenthous on tenk1 + Index Cond: ((thousand < 2) AND (tenthous = ANY ('{1001,3000}'::integer[]))) +(4 rows) SELECT thousand, tenthous FROM tenk1 WHERE thousand < 2 AND tenthous IN (1001,3000) I'd be tempted to disable the test, but it also found genuine issues in a bunch of CF entries, and all these test instabilities seem like ones we'd also see, with a lower occurrence on the buildfarm. WRT the fdw_retry_check: I wonder if we should increase the log level of a) pgfdw_inval_callback deciding to disconnect b) ReceiveSharedInvalidMessages() deciding to reset to DEBUG1, at least temporarily? Alternatively we could add a SET log_min_messages=debug4; ... RESET log_min_messages; around the postgres_fdw connection re-establishment test? One thing nudging me towards the more global approach is that I have the vague feelign that there's a wider issue around hitting more sinval resets than we should - and it'd right now be very hard to know about that. Greetings, Andres Freund