Re: SQL:2011 application time

Paul Jungwirth Tue, 28 Jan 2025 22:41:12 -0800

On 1/27/25 07:47, Peter Eisentraut wrote:
> On 24.01.25 03:55, Tom Lane wrote:
>> I've now run an exhaustive search through the last three months of
>> buildfarm runs, and found just one additional instance of the same
>> failure.  The three matches are
>>
>> 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2025-01-22%2005%3A49%3A08
>> 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-01-22%2001%3A29%3A35
>> 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2025-01-22%2001%3A17%3A14
>>
>> Since those are all post-1772d554b0, it's difficult to avoid the
>> conclusion that that either introduced the error or allowed a
>> pre-existing problem to become visible.
>
> I found a few more in the bowels of various Cirrus CI jobs:
>
> https://cirrus-ci.com/task/5125479033733120
> -> 
https://api.cirrus-ci.com/v1/artifact/task/5125479033733120/testrun/build/testrun/regress/
> regress/regression.diffs
>
> https://cirrus-ci.com/task/4562529080311808
> -> 
https://api.cirrus-ci.com/v1/artifact/task/4562529080311808/log/src/test/recovery/tmp_check/
> regression.diffs
>
> https://cirrus-ci.com/task/5985049025183744
> -> 
https://api.cirrus-ci.com/v1/artifact/task/5985049025183744/log/src/bin/pg_upgrade/tmp_check/
> regression.diffs
>
> https://cirrus-ci.com/task/4702566639992832
> -> 
https://api.cirrus-ci.com/v1/artifact/task/4702566639992832/log/src/test/recovery/tmp_check/
> regression.diffs

Thanks to you both for finding some more examples! This answers one question I've been wonderingabout: Why do we get this failure for range types . . .:

```

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out/tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/without_overlaps.out--- /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out 2025-01-2503:14:11.906404866 +0000+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/without_overlaps.out 2025-01-2503:21:08.218167913 +0000

@@ -1792,8 +1792,6 @@
   SET valid_at = CASE WHEN lower(valid_at) = '2018-01-01' THEN 
daterange('2018-01-01', '2018-01-05')

WHEN lower(valid_at) = '2018-02-01' THEN daterange('2018-01-05','2018-03-01') END

   WHERE id = '[6,7)';

-ERROR: update or delete on table "temporal_rng" violates RESTRICT setting of foreign keyconstraint "temporal_fk_rng2rng_fk" on table "temporal_fk_rng2rng"-DETAIL: Key (id, valid_at)=([6,7), [2018-01-01,2018-02-01)) is referenced from table"temporal_fk_rng2rng".

 -- a PK update that fails because both are referenced (even before commit):
 BEGIN;
   ALTER TABLE temporal_fk_rng2rng
```

. . . but never a failure later in the file for the same scenario with multiranges? But Peter'slinks [1] now include an example of that too:

```

diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out/tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/results/without_overlaps.out--- /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out 2025-01-2504:31:18.353128126 +0000+++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/results/without_overlaps.out 2025-01-2504:35:22.020363327 +0000

@@ -2311,8 +2311,6 @@

SET valid_at = CASE WHEN lower(valid_at) = '2018-01-01' THENdatemultirange(daterange('2018-01-01', '2018-01-05'))WHEN lower(valid_at) = '2018-02-01' THENdatemultirange(daterange('2018-01-05', '2018-03-01')) END

   WHERE id = '[6,7)';

-ERROR: update or delete on table "temporal_mltrng" violates RESTRICT setting of foreign keyconstraint "temporal_fk_mltrng2mltrng_fk" on table "temporal_fk_mltrng2mltrng"-DETAIL: Key (id, valid_at)=([6,7), {[2018-01-01,2018-02-01)}) is referenced from table"temporal_fk_mltrng2mltrng".

 -- a PK update that fails because both are referenced (even before commit):
 BEGIN;
   ALTER TABLE temporal_fk_mltrng2mltrng
```

So that is relieving. Still it's interesting that it's a 6:1 ratio.

I've attached a patch that causes both failures to appear every time (v48.0). It shows that if theRESTRICT constraint accidentally loaded the cached query plan from the most recently cached NOACTION constraint (which we test just before testing RESTRICT), it would create matching failures.So some kind of oid conflict could cause that.


On 1/27/25 07:56, Peter Eisentraut wrote:
> I think this call in ri_restrict()
>
>      ri_BuildQueryKey(&qkey, riinfo, RI_PLAN_RESTRICT);
>
> needs to use a different third argument for NO ACTION vs. RESTRICT, since we 
are now sometimes using
> different queries for them.
>
> However, the RI_QueryKey also uses the constraint OID as part of the hash 
key, so even this mistake
> would not trigger any bad effect unless we also have OID collisions?

That is my take too. I haven't worked out how an OID collision could happen though. Since we arerunning the tests in parallel could the other tests generate enough oids to roll the counter around?Surely not. And I confirmed the dynahash does a memcmp on the whole 64 bits of key->constr_id +key->constr_queryno. Landing in the same hash bucket shouldn't be a problem (though I haven't testedthat, e.g. by using a debugger to manipulate the hash result). So I don't have anything plausible here.

I thought about introducing a new RI_PLAN_NO_ACTION constant back when I wrote the patch. Itshouldn't be needed, but it would be reassuring to include it, especially since the generated SQLchanges on NO ACTION vs RESTRICT. (On the other hand the generated SQL also depends on the PK/FKattributes and their comparison operators, and we don't include *those* in the cache key.)

Is it possible to commit an RI_PLAN_NO_ACTION addition and see if that makes the buildfarm failuresgo away? Here is a proposed patch for that (v48.1). I would understand if this is too questionable apractice---but it would be nice to get sufficient test exposure to see if it makes a difference.Since I still haven't reproduced this locally (despite running continuously for almost a week), it'snot an experiment I can do myself. If it *does* make the failures go away, then it suggests there isstill some latent problem somewhere.

I took a look through the dynahash code as well as GetNewOidWithIndex/GetNewObjectId. Neither ofthese really seem like likely places to find a bug to me, considering how mature and heavily-usedthey are. The hash table isn't even shared between backends. The way we keep a nextOid in sharedmemory and keep incrementing it until we find a gap is maybe interesting. Since we drop & create aconstraint right before the failing test, I guess it's possible to cycle around and get the same oidas the dropped constraint. I don't really buy it though. We would have to give the NO ACTIONconstraint a very low oid (so there aren't lower-numbered gaps produced by the same test file), dropit, somehow consume 2^32 oids (in the other tests from the parallel group), and then land back onthe low-numbered open oid. Also from the failure I checked I don't see any log messages about "newOID has been assigned in relation ... after ... retries". I guess you could hit the right oidwithout that, but it seems hard.

That's it so far. Adding v48.1 would at least give us some more evidence about where to look forproblems. In the meantime I'll keep searching for a way to reproduce it!

[1] https://cirrus-ci.com/task/5985049025183744 ->https://api.cirrus-ci.com/v1/artifact/task/5985049025183744/log/src/bin/pg_upgrade/tmp_check/regression.diffs


Yours,

--
Paul              ~{:-)
p...@illuminatedcomputing.com

From b8c3bfb8617066c924f7b74f7cc5e6ad08cd4261 Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" <p...@illuminatedcomputing.com>
Date: Tue, 28 Jan 2025 20:11:31 -0800
Subject: [PATCH v0] Trigger the RESTRICT build farm failures consistently

---
 src/backend/utils/adt/ri_triggers.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/backend/utils/adt/ri_triggers.c 
b/src/backend/utils/adt/ri_triggers.c
index 3d9985b17c2..8ebdafc9913 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -701,6 +701,8 @@ RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
        return ri_restrict((TriggerData *) fcinfo->context, false);
 }
 
+static Oid lastCachedConstraint = InvalidOid;
+
 /*
  * ri_restrict -
  *
@@ -753,6 +755,8 @@ ri_restrict(TriggerData *trigdata, bool is_no_action)
         * query for delete and update cases)
         */
        ri_BuildQueryKey(&qkey, riinfo, RI_PLAN_RESTRICT);
+       if (!is_no_action)
+               qkey.constr_id = lastCachedConstraint;
 
        if ((qplan = ri_FetchPreparedPlan(&qkey)) == NULL)
        {
@@ -884,6 +888,7 @@ ri_restrict(TriggerData *trigdata, bool is_no_action)
                /* Prepare and save the plan */
                qplan = ri_PlanCheck(querybuf.data, riinfo->nkeys, queryoids,
                                                         &qkey, fk_rel, pk_rel);
+               lastCachedConstraint = riinfo->constraint_id;
        }
 
        /*
-- 
2.42.0

From cdaa86c96ad0ede47c6ea01d6a961090741d7fbf Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" <p...@illuminatedcomputing.com>
Date: Tue, 28 Jan 2025 20:23:22 -0800
Subject: [PATCH v48.1] Cache NO ACTION foreign keys separately from RESTRICT
 foreign keys

Now that we generate different SQL for temporal NO ACTION vs RESTRICT
foreign keys, we should cache their query plans with different keys.
Since the key also includes the constraint oid, this shouldn't be
necessary, but we have been seeing build farm failures that suggest we
might be sometimes using a cached NO ACTION plan to implement a RESTRICT
constraint.
---
 src/backend/utils/adt/ri_triggers.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 3d9985b17c2..8473448849c 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -74,12 +74,13 @@
 /* these queries are executed against the FK (referencing) table: */
 #define RI_PLAN_CASCADE_ONDELETE		3
 #define RI_PLAN_CASCADE_ONUPDATE		4
+#define RI_PLAN_NO_ACTION				5
 /* For RESTRICT, the same plan can be used for both ON DELETE and ON UPDATE triggers. */
-#define RI_PLAN_RESTRICT				5
-#define RI_PLAN_SETNULL_ONDELETE		6
-#define RI_PLAN_SETNULL_ONUPDATE		7
-#define RI_PLAN_SETDEFAULT_ONDELETE		8
-#define RI_PLAN_SETDEFAULT_ONUPDATE		9
+#define RI_PLAN_RESTRICT				6
+#define RI_PLAN_SETNULL_ONDELETE		7
+#define RI_PLAN_SETNULL_ONUPDATE		8
+#define RI_PLAN_SETDEFAULT_ONDELETE		9
+#define RI_PLAN_SETDEFAULT_ONUPDATE		10
 
 #define MAX_QUOTED_NAME_LEN  (NAMEDATALEN*2+3)
 #define MAX_QUOTED_REL_NAME_LEN  (MAX_QUOTED_NAME_LEN*2)
@@ -752,7 +753,7 @@ ri_restrict(TriggerData *trigdata, bool is_no_action)
 	 * Fetch or prepare a saved plan for the restrict lookup (it's the same
 	 * query for delete and update cases)
 	 */
-	ri_BuildQueryKey(&qkey, riinfo, RI_PLAN_RESTRICT);
+	ri_BuildQueryKey(&qkey, riinfo, is_no_action ? RI_PLAN_NO_ACTION : RI_PLAN_RESTRICT);
 
 	if ((qplan = ri_FetchPreparedPlan(&qkey)) == NULL)
 	{
-- 
2.42.0

Re: SQL:2011 application time

Reply via email to