Re: tests against running server occasionally fail, postgres_fdw & tenk1

Andres Freund Mon, 06 Feb 2023 17:53:22 -0800

Hi,

On 2022-12-08 16:36:07 -0800, Andres Freund wrote:
> On 2022-12-08 16:15:11 -0800, Andres Freund wrote:
> > Unfortunately cfbot shows that that doesn't work entirely reliably.
> >
> > The most frequent case is postgres_fdw, which somewhat regularly fails with 
> > a
> > regression.diff like this:
> >
> > diff -U3 
> > /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out 
> > /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
> > --- /tmp/cirrus-ci-build/contrib/postgres_fdw/expected/postgres_fdw.out     
> > 2022-12-08 20:35:24.772888000 +0000
> > +++ 
> > /tmp/cirrus-ci-build/build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out
> >         2022-12-08 20:43:38.199450000 +0000
> > @@ -9911,8 +9911,7 @@
> >     WHERE application_name = 'fdw_retry_check';
> >   pg_terminate_backend
> >  ----------------------
> > - t
> > -(1 row)
> > +(0 rows)
> >
> >  -- This query should detect the broken connection when starting new remote
> >  -- transaction, reestablish new connection, and then succeed.


> >
> > I guess that a cache reset message arrives and leads to the connection being
> > terminated. Unfortunately that's hard to see right now, as the relevant log
> > messages are output with DEBUG3 - it's quite verbose, so enabling it for all
> > tests will be painful.
> >
> > I *think* I have seen this failure locally at least once, when running the
> > test normally.
> >
> >
> > I'll try to reproduce this locally for a bit. If I can't, the only other 
> > idea
> > I have for debugging this is to change log_min_messages in that section of 
> > the
> > postgres_fdw test to DEBUG3.
>
> Oh. I tried to reproduce the issue, without success so far, but eventually my
> test loop got stuck in something I reported previously and forgot about since:
> https://www.postgresql.org/message-id/20220925232237.p6uskba2dw6fnwj2%40awork3.anarazel.de
>
> I wonder if the timing on the freebsd CI task works out to hitting a "smaller
> version" of the problem that eventually resolves itself, which then leads to a
> sinval reset getting sent, causing the observable problem.

The issue referenced above is now fixed, and I haven't seen instances of it
since then. I also just now fixed the issue that often lead to failing to
upload logs.

Unfortunately the fdw_retry_check issue from above has re-occurred since then:

https://cirrus-ci.com/task/4901157940756480
https://api.cirrus-ci.com/v1/artifact/task/4901157940756480/testrun/build/testrun/postgres_fdw-running/regress/regression.diffs


Another run hit an issue we've been fighting repeatedly on the buildfarm / CI:
https://cirrus-ci.com/task/5527490404286464
https://api.cirrus-ci.com/v1/artifact/task/5527490404286464/testrun/build/testrun/regress-running/regress/regression.diffs

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out 
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/create_index.out     
2023-02-06 23:52:43.627604000 +0000
+++ 
/tmp/cirrus-ci-build/build/testrun/regress-running/regress/results/create_index.out
 2023-02-07 00:03:04.535232000 +0000
@@ -1930,12 +1930,13 @@
 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)
 ORDER BY thousand;
-                      QUERY PLAN
--------------------------------------------------------
- Index Only Scan using tenk1_thous_tenthous on tenk1
-   Index Cond: (thousand < 2)
-   Filter: (tenthous = ANY ('{1001,3000}'::integer[]))
-(3 rows)
+                                      QUERY PLAN
+--------------------------------------------------------------------------------------
+ Sort
+   Sort Key: thousand
+   ->  Index Only Scan using tenk1_thous_tenthous on tenk1
+         Index Cond: ((thousand < 2) AND (tenthous = ANY 
('{1001,3000}'::integer[])))
+(4 rows)

 SELECT thousand, tenthous FROM tenk1
 WHERE thousand < 2 AND tenthous IN (1001,3000)


I'd be tempted to disable the test, but it also found genuine issues in a
bunch of CF entries, and all these test instabilities seem like ones we'd also
see, with a lower occurrence on the buildfarm.


WRT the fdw_retry_check: I wonder if we should increase the log level of
a) pgfdw_inval_callback deciding to disconnect
b) ReceiveSharedInvalidMessages() deciding to reset

to DEBUG1, at least temporarily?

Alternatively we could add a
SET log_min_messages=debug4;
...
RESET log_min_messages;

around the postgres_fdw connection re-establishment test?


One thing nudging me towards the more global approach is that I have the vague
feelign that there's a wider issue around hitting more sinval resets than we
should - and it'd right now be very hard to know about that.


Greetings,

Andres Freund

Re: tests against running server occasionally fail, postgres_fdw & tenk1

Reply via email to