Hi. We're getting unit-test failures that look like data-races,
which somehow are getting very frequent recently, tripping our CI.

Basically we have a `services` table, for service discovery, which
records the backend PID of its main DB Connection.  Such that we can
account for "stale" / "zombie" services, which are still registered
(e.g. crashed), but who's connection is gone, when querying that
`services` table, like so:

```sql
select s.name, s.backend_pid, ...
  from services s
  join pg_stat_activity a
    on a.pid = s.backend_pid
 where a.datname = current_database()
```

The unit-test code is "linear", i.e. single-threaded, with pseudo-code:

1) Open Connection outer
2) Open Connection inner
3) register service using inner (i.e. add row in `services` with
backend PID of inner)
4) Close Connection inner (calls PQfinish)
5) Run query above on outer. Sometimes still sees that "stale" row,
despite closing inner.

There's is no question about the ordering above.
i.e. PQfinish(inner) returned before the query is run.

there's of course the possibility of another (3rd) connection reusing
the same backend PID, but that's remote a chance I believe. And I
don't know of any other reliable ID for a connection, than its backend
PID.

So when and how fast does pg_stat_activity update, in the face of a
PQfinish? What other scheme to detect "table rows" associated to
"stale" connections?

Thanks for any insights, --DD


Reply via email to