On 11/23/19 8:34 AM, Tom Lane wrote:
In connection with a different issue, I wrote:

* The point of calling asyncQueueReadAllNotifications in
Exec_ListenPreCommit is to advance over already-committed queue entries
before we start sending any events to the client.  Without this, a
newly-listening client could be sent some very stale events.  (Note
that 51004c717 changed this from "somewhat stale" to "very stale".

It suddenly strikes me to worry that we have an XID wraparound hazard
for entries in the notify queue.  The odds of seeing a live problem
with that before 51004c717 were pretty minimal, but now that we don't
aggressively advance the queue tail, I think it's a very real risk for
low-notify-traffic installations.  In the worst case, imagine

* Somebody sends one NOTIFY, maybe just as a test.

* Nothing happens for a couple of weeks, during which the XID counter
advances by 2 billion or so.

* Newly-listening sessions will now think that that old event is
"in the future", hence fail to advance over it, resulting in denial
of service for new notify traffic.  There is no recourse short of
a server restart or waiting another couple weeks for wraparound.

Is it worth checking for this condition in autovacuum?  Even for
installations with autovacuum disabled, would the anti-wraparound
vacuums happen frequently enough to also advance the tail if modified
to test for this condition?

I thought about fixing this by tracking the queue's oldest XID in
the shared queue info, and forcing a tail advance when that got
too old --- but if nobody actively uses any listen or notify
features for awhile, no such code is going to execute, so the
above scenario could happen anyway.

The only bulletproof fix I can think of offhand is to widen the
queue entries to 64-bit XIDs, which is a tad annoying from a
space consumption standpoint.  Possibly we could compromise by
storing the high-order bits just once per SLRU page (and then
forcing an advance to a new page when those bits change).

A somewhat less bulletproof fix is to detect far-in-the-future queue
entries and discard them.  That would prevent the freezeup scenario,
but there'd be a residual hazard of transmitting ancient
notifications to clients (if a queue entry survived for 4G
transactions not just 2G).  But maybe that's OK, given the rather
tiny probabilities involved, and the low guarantees around notify
reliability in general.  It'd be a much more back-patchable answer
than a queue format change, too.

There shouldn't be too much reason to back-patch any of this, since
the change in 51004c717 only applies to v13 and onward.  Or do you
see the risk you described as "pretty minimal" as still being large
enough to outweigh the risk of anything we might back-patch?



--
Mark Dilger


Reply via email to