On Thu, 29 Jan 2026 at 15:09, Adrian Klaver <[email protected]> wrote:
> Question, why is this not run in a single process?

A little to do with logic separation but honestly mainly because of
history. Also the need for the first process to be allowed to get on
with other stuff while the work is done by the second process.

Yes, it could be threaded, but this is 35-year-old code, and while we
could rewrite it to use threads but that really shouldn't be required,
should it?

> Provide the code for the procedure(s) that create the table and send the
> ID to the other process.

I can't provide the entire code, because it's not open. It's using
libpq and our own wrapper around PQsendQuery that handles things like
autoretries, auto-reconnection and grabbing all results to ensure the
connection is kept in a consistent state.

What we do

INSERT INTO sometable (somefield) VALUES ('string') RETURNING seqid

We use that seqid to create a uniquely-named temp table

CREATE UNLOGGED TABLE qtemp%d (scid INTEGER PRIMARY KEY DEFAULT 0 NOT NULL)

We then do (in a loop around a list)

INSERT INTO qtemp%d (scid) values (%d) ON CONFLICT DO NOTHING

with the seqid from the first query and the values from our list

Each of these queries is tested for success by our wrapper. We would
see a failure in the log if any part of this had failed.

None of these queries is inside an explicit transaction.

We then send that seqid value to the second process's socket, and that
process then runs a query that looks something like:

SELECT DISTINCT 0 AS gid, lid FROM galloc
INNER JOIN qtemp%d as qt USING(sc_id)
WHERE  EXISTS (SELECT 1 FROM stg where s_id in (993,996,994,995,997)
and stg.party=0 AND stg.scid=qt.scid)
AND NOT EXISTS (SELECT 1 FROM gl WHERE gl.lid=galloc.lid AND gl.g_id=422)'

We get the return error:

Primary: relation "qtempXXX" does not exist

(where XXX is the seqid)

The logging is verbose enough that I can see the seqid being used to
CREATE and INSERT the values matches the value that is being received
by the second process (and the query that is run matches that value).

Further, as I said previously it's very important that the receiving
process sometimes succeeds on retry - our wrapper retries the same
query string five times on fatal error, just in case of transient
failures. Sometimes we get 5 fatal errors in the log, sometimes only 2
(after which it then continues successfully).

Were it not for the retries succeeding I could accept that we might
somehow have failed to create the table (even though we are testing
for the success of the CREATE and the INSERTs), but it seems bizarre
to me that this could fail and then fail and then suddenly work.

There is no way the code could be dropping the table and then
recreating it after transmitting the seqid: there is no facility in
the code to drop the table. We do have a cron job that drops all
qtemp* tables overnight (at 23:00), but this error is occurring at
times nowhere near that (e.g. yesterday we saw this error at 13:28).

I'm fully prepared to accept that I'm doing something wrong because
I've tried to reproduce the error using a couple of test PHP processes
with a socket between them and haven't managed to reproduce the
failure in several million tries, but I'm at a loss as to what to look
for, so any suggestion would be gratefully accepted.

Geoff


Reply via email to