On 1/27/25 07:47, Peter Eisentraut wrote:
> On 24.01.25 03:55, Tom Lane wrote:
>> I've now run an exhaustive search through the last three months of
>> buildfarm runs, and found just one additional instance of the same
>> failure. The three matches are
>>
>>
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2025-01-22%2005%3A49%3A08
>>
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2025-01-22%2001%3A29%3A35
>>
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2025-01-22%2001%3A17%3A14
>>
>> Since those are all post-1772d554b0, it's difficult to avoid the
>> conclusion that that either introduced the error or allowed a
>> pre-existing problem to become visible.
>
> I found a few more in the bowels of various Cirrus CI jobs:
>
> https://cirrus-ci.com/task/5125479033733120
> ->
https://api.cirrus-ci.com/v1/artifact/task/5125479033733120/testrun/build/testrun/regress/
> regress/regression.diffs
>
> https://cirrus-ci.com/task/4562529080311808
> ->
https://api.cirrus-ci.com/v1/artifact/task/4562529080311808/log/src/test/recovery/tmp_check/
> regression.diffs
>
> https://cirrus-ci.com/task/5985049025183744
> ->
https://api.cirrus-ci.com/v1/artifact/task/5985049025183744/log/src/bin/pg_upgrade/tmp_check/
> regression.diffs
>
> https://cirrus-ci.com/task/4702566639992832
> ->
https://api.cirrus-ci.com/v1/artifact/task/4702566639992832/log/src/test/recovery/tmp_check/
> regression.diffs
Thanks to you both for finding some more examples! This answers one question I've been wondering
about: Why do we get this failure for range types . . .:
```
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out
/tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/without_overlaps.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out 2025-01-25
03:14:11.906404866 +0000
+++ /tmp/cirrus-ci-build/src/test/recovery/tmp_check/results/without_overlaps.out 2025-01-25
03:21:08.218167913 +0000
@@ -1792,8 +1792,6 @@
SET valid_at = CASE WHEN lower(valid_at) = '2018-01-01' THEN
daterange('2018-01-01', '2018-01-05')
WHEN lower(valid_at) = '2018-02-01' THEN daterange('2018-01-05',
'2018-03-01') END
WHERE id = '[6,7)';
-ERROR: update or delete on table "temporal_rng" violates RESTRICT setting of foreign key
constraint "temporal_fk_rng2rng_fk" on table "temporal_fk_rng2rng"
-DETAIL: Key (id, valid_at)=([6,7), [2018-01-01,2018-02-01)) is referenced from table
"temporal_fk_rng2rng".
-- a PK update that fails because both are referenced (even before commit):
BEGIN;
ALTER TABLE temporal_fk_rng2rng
```
. . . but never a failure later in the file for the same scenario with multiranges? But Peter's
links [1] now include an example of that too:
```
diff -U3 /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out
/tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/results/without_overlaps.out
--- /tmp/cirrus-ci-build/src/test/regress/expected/without_overlaps.out 2025-01-25
04:31:18.353128126 +0000
+++ /tmp/cirrus-ci-build/src/bin/pg_upgrade/tmp_check/results/without_overlaps.out 2025-01-25
04:35:22.020363327 +0000
@@ -2311,8 +2311,6 @@
SET valid_at = CASE WHEN lower(valid_at) = '2018-01-01' THEN
datemultirange(daterange('2018-01-01', '2018-01-05'))
WHEN lower(valid_at) = '2018-02-01' THEN
datemultirange(daterange('2018-01-05', '2018-03-01')) END
WHERE id = '[6,7)';
-ERROR: update or delete on table "temporal_mltrng" violates RESTRICT setting of foreign key
constraint "temporal_fk_mltrng2mltrng_fk" on table "temporal_fk_mltrng2mltrng"
-DETAIL: Key (id, valid_at)=([6,7), {[2018-01-01,2018-02-01)}) is referenced from table
"temporal_fk_mltrng2mltrng".
-- a PK update that fails because both are referenced (even before commit):
BEGIN;
ALTER TABLE temporal_fk_mltrng2mltrng
```
So that is relieving. Still it's interesting that it's a 6:1 ratio.
I've attached a patch that causes both failures to appear every time (v48.0). It shows that if the
RESTRICT constraint accidentally loaded the cached query plan from the most recently cached NO
ACTION constraint (which we test just before testing RESTRICT), it would create matching failures.
So some kind of oid conflict could cause that.
On 1/27/25 07:56, Peter Eisentraut wrote:
> I think this call in ri_restrict()
>
> ri_BuildQueryKey(&qkey, riinfo, RI_PLAN_RESTRICT);
>
> needs to use a different third argument for NO ACTION vs. RESTRICT, since we
are now sometimes using
> different queries for them.
>
> However, the RI_QueryKey also uses the constraint OID as part of the hash
key, so even this mistake
> would not trigger any bad effect unless we also have OID collisions?
That is my take too. I haven't worked out how an OID collision could happen though. Since we are
running the tests in parallel could the other tests generate enough oids to roll the counter around?
Surely not. And I confirmed the dynahash does a memcmp on the whole 64 bits of key->constr_id +
key->constr_queryno. Landing in the same hash bucket shouldn't be a problem (though I haven't tested
that, e.g. by using a debugger to manipulate the hash result). So I don't have anything plausible here.
I thought about introducing a new RI_PLAN_NO_ACTION constant back when I wrote the patch. It
shouldn't be needed, but it would be reassuring to include it, especially since the generated SQL
changes on NO ACTION vs RESTRICT. (On the other hand the generated SQL also depends on the PK/FK
attributes and their comparison operators, and we don't include *those* in the cache key.)
Is it possible to commit an RI_PLAN_NO_ACTION addition and see if that makes the buildfarm failures
go away? Here is a proposed patch for that (v48.1). I would understand if this is too questionable a
practice---but it would be nice to get sufficient test exposure to see if it makes a difference.
Since I still haven't reproduced this locally (despite running continuously for almost a week), it's
not an experiment I can do myself. If it *does* make the failures go away, then it suggests there is
still some latent problem somewhere.
I took a look through the dynahash code as well as GetNewOidWithIndex/GetNewObjectId. Neither of
these really seem like likely places to find a bug to me, considering how mature and heavily-used
they are. The hash table isn't even shared between backends. The way we keep a nextOid in shared
memory and keep incrementing it until we find a gap is maybe interesting. Since we drop & create a
constraint right before the failing test, I guess it's possible to cycle around and get the same oid
as the dropped constraint. I don't really buy it though. We would have to give the NO ACTION
constraint a very low oid (so there aren't lower-numbered gaps produced by the same test file), drop
it, somehow consume 2^32 oids (in the other tests from the parallel group), and then land back on
the low-numbered open oid. Also from the failure I checked I don't see any log messages about "new
OID has been assigned in relation ... after ... retries". I guess you could hit the right oid
without that, but it seems hard.
That's it so far. Adding v48.1 would at least give us some more evidence about where to look for
problems. In the meantime I'll keep searching for a way to reproduce it!
[1] https://cirrus-ci.com/task/5985049025183744 ->
https://api.cirrus-ci.com/v1/artifact/task/5985049025183744/log/src/bin/pg_upgrade/tmp_check/regression.diffs
Yours,
--
Paul ~{:-)
p...@illuminatedcomputing.com
From b8c3bfb8617066c924f7b74f7cc5e6ad08cd4261 Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" <p...@illuminatedcomputing.com>
Date: Tue, 28 Jan 2025 20:11:31 -0800
Subject: [PATCH v0] Trigger the RESTRICT build farm failures consistently
---
src/backend/utils/adt/ri_triggers.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/src/backend/utils/adt/ri_triggers.c
b/src/backend/utils/adt/ri_triggers.c
index 3d9985b17c2..8ebdafc9913 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -701,6 +701,8 @@ RI_FKey_restrict_upd(PG_FUNCTION_ARGS)
return ri_restrict((TriggerData *) fcinfo->context, false);
}
+static Oid lastCachedConstraint = InvalidOid;
+
/*
* ri_restrict -
*
@@ -753,6 +755,8 @@ ri_restrict(TriggerData *trigdata, bool is_no_action)
* query for delete and update cases)
*/
ri_BuildQueryKey(&qkey, riinfo, RI_PLAN_RESTRICT);
+ if (!is_no_action)
+ qkey.constr_id = lastCachedConstraint;
if ((qplan = ri_FetchPreparedPlan(&qkey)) == NULL)
{
@@ -884,6 +888,7 @@ ri_restrict(TriggerData *trigdata, bool is_no_action)
/* Prepare and save the plan */
qplan = ri_PlanCheck(querybuf.data, riinfo->nkeys, queryoids,
&qkey, fk_rel, pk_rel);
+ lastCachedConstraint = riinfo->constraint_id;
}
/*
--
2.42.0
From cdaa86c96ad0ede47c6ea01d6a961090741d7fbf Mon Sep 17 00:00:00 2001
From: "Paul A. Jungwirth" <p...@illuminatedcomputing.com>
Date: Tue, 28 Jan 2025 20:23:22 -0800
Subject: [PATCH v48.1] Cache NO ACTION foreign keys separately from RESTRICT
foreign keys
Now that we generate different SQL for temporal NO ACTION vs RESTRICT
foreign keys, we should cache their query plans with different keys.
Since the key also includes the constraint oid, this shouldn't be
necessary, but we have been seeing build farm failures that suggest we
might be sometimes using a cached NO ACTION plan to implement a RESTRICT
constraint.
---
src/backend/utils/adt/ri_triggers.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/src/backend/utils/adt/ri_triggers.c b/src/backend/utils/adt/ri_triggers.c
index 3d9985b17c2..8473448849c 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -74,12 +74,13 @@
/* these queries are executed against the FK (referencing) table: */
#define RI_PLAN_CASCADE_ONDELETE 3
#define RI_PLAN_CASCADE_ONUPDATE 4
+#define RI_PLAN_NO_ACTION 5
/* For RESTRICT, the same plan can be used for both ON DELETE and ON UPDATE triggers. */
-#define RI_PLAN_RESTRICT 5
-#define RI_PLAN_SETNULL_ONDELETE 6
-#define RI_PLAN_SETNULL_ONUPDATE 7
-#define RI_PLAN_SETDEFAULT_ONDELETE 8
-#define RI_PLAN_SETDEFAULT_ONUPDATE 9
+#define RI_PLAN_RESTRICT 6
+#define RI_PLAN_SETNULL_ONDELETE 7
+#define RI_PLAN_SETNULL_ONUPDATE 8
+#define RI_PLAN_SETDEFAULT_ONDELETE 9
+#define RI_PLAN_SETDEFAULT_ONUPDATE 10
#define MAX_QUOTED_NAME_LEN (NAMEDATALEN*2+3)
#define MAX_QUOTED_REL_NAME_LEN (MAX_QUOTED_NAME_LEN*2)
@@ -752,7 +753,7 @@ ri_restrict(TriggerData *trigdata, bool is_no_action)
* Fetch or prepare a saved plan for the restrict lookup (it's the same
* query for delete and update cases)
*/
- ri_BuildQueryKey(&qkey, riinfo, RI_PLAN_RESTRICT);
+ ri_BuildQueryKey(&qkey, riinfo, is_no_action ? RI_PLAN_NO_ACTION : RI_PLAN_RESTRICT);
if ((qplan = ri_FetchPreparedPlan(&qkey)) == NULL)
{
--
2.42.0