Hi, On 2018-08-29 12:56:07 -0400, Tom Lane wrote: > I wrote: > > * We now recursively enter ScanPgRelation, which (again) needs to do a > > search using pg_class_oid_index, so it (again) opens and locks that. > > BUT: LockRelationOid sees that *this process already has share lock on > > pg_class_oid_index*, so it figures it can skip AcceptInvalidationMessages. > > BTW, I now have a theory for why we suddenly started seeing this problem > in mid-June: commits a54e1f158 et al added a ScanPgRelation call where > there had been none before (in RelationReloadNailed, for non-index rels). > That didn't create the problem, but it probably increased the odds of > seeing it happen.
Yea. Doesn't explain why it's only really visible on the BF in 11/master though :/ > Also ... isn't the last "relation->rd_isvalid = true" in > RelationReloadNailed wrong? If it got cleared during ScanPgRelation, > I do not think we want to believe that we got an up-to-date row. I don't really think so - note how a normal relcache inval essentially does the same. RelationClearRelation() first marks the entry as invalid, then goes and builds a new entry that's *not* hooked into the hashtable (therefore doesn't receive new invals), and then moves the contents over. That overwrites rd_isvalid to true, as that's guaranteed to be set by by RelationBuildDesc(). During the move no new invalidations are accepted. So this really is just behaving equivalently. The harder question is why that's safe. I think I convinced myself that it is a couple times over the years, but I don't think we've properly documented it. As the header says: * The following code contains many undocumented hacks. Please be * careful.... We definitely relied on RelationClearRelation() always returning a valid record for a while, c.f. RelationIdGetRelation()'s rd_isvalid assertion, and the lack of a loop in that function. (There's no coffee in this hotel at 4am. Shame.) Ah, yes. This assumption is currently safe because the locking on relations being looked up, better guarantees that there's no critical changes to relcache entries while the entry is being rebuilt. I think we'd also run into trouble with clobber cache recursively etc without it. Greetings, Andres Freund