Robert Haas <robertmh...@gmail.com> writes: > On Tue, Oct 25, 2011 at 9:06 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> Ordinarily, sending out sinval messages post-commit is okay because we >> don't release locks until after that, and we suppose that our locks >> prevent any other transactions from getting to the point of using >> syscache entries that might have been invalidated by our transaction. >> However, *we have carefully hacked on ANALYZE until it doesn't take any >> locks that would block concurrent queries on the analyzed table.* So >> the normal protection against stale syscache entries simply doesn't >> work for pg_statistic fetches.
> This is very similar to one of the issues that reared its ugly head in > regards to Simon's now-reverted patch to lower DDL locking strength. > You identified some other issues there as well, but *one* of the > issues was that, as in this case, the sinval mechanism fails to > provide the necessary synchronization guarantees unless the lock > required to reread the updated data conflicts with the lock required > to change the data. Right. We may take as little as AccessShareLock on a relation before examining its pg_statistic entries, and ANALYZE isn't taking anything that would block that. > So I'm wondering if we ought to rethink our position that users of the > sinval machinery must provide their own external synchronization > through heavyweight locking, and instead build the synchronization > into the sinval mechanism itself. Yeah, it's starting to feel like we need a basic redesign of sinval ... although I'd not care to back-patch that, so we also need to think of a sane solution for the back branches. > If we sent out sinval messages just before removing ourselves from the > ProcArray, I think that would more-or-less fix this bug (although > maybe I'm missing some reason why it's not practical to send them that > early) except that I don't see any way to handle the sinval-reset > case, which seems to more or less kill this idea in its tracks. The other reason that doesn't work is there's a race condition: someone might load their cache entry immediately after the sinval message went past, but before the updating transaction commits. > But maybe there's some other mechanism whereby we could combine > sending the sinval messages slightly earlier (before > ProcArrayEndTransaction) with blocking anyone who processes those > messages until after the committing backend finishes > ProcArrayEndTransaction. For example, you could add an additional > LWLock, which has to be held in exclusive mode by a committing > transaction that sends any sinval messages. Doesn't sound very scalable :-(. Even given your recent changes to reduce the overhead of checking for sinval messages, I'm not sure that it'd be practical to move the sinval message processing to just-before-we-look-up-a-cache-entry. Right now, we do AcceptInvalidationMessages basically once per table per query (or maybe it's twice or so, but anyway a very small multiplier on that). If we try to do it every time through SearchSysCache, we are probably talking two to three orders of magnitude more checks, which ISTM is certain to push the sinval queue back up to the top of the heap for contention. But in any case, this isn't the core of the problem. The real point here is that we need a guarantee that a syscache entry we're going to use is/was valid as of some suitable time point later than the start of our current transaction. (Once we have taken a snapshot, VACUUM will know that it can't remove any tuples that were deleted after the time of that snapshot; so even for SnapshotNow fetches, it's important to have an MVCC snapshot to protect toast-table dereferences.) Perhaps rather than tying the problem into SearchSysCache, we should attach the overhead to GetTransactionSnapshot, which is called appealingly few times per query. But right offhand it seems like that only protects us against the toast-tuple-deletion problem, not against the more general one of getting a stale view of the status of some relation. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers