On Sun, Aug 12, 2018 at 9:05 AM, David Rowley <david.row...@2ndquadrant.com> wrote: > This is not a fully baked idea, but I'm wondering if a better way to > do this, instead of having this PartitionIsValid macro to determine if > the partition should be visible to the current transaction ID, we > could, when we invalidate a relcache entry, send along the transaction > ID that it's invalid from. Other backends when they process the > invalidation message they could wipe out the cache entry only if their > xid is >= the invalidation's "xmax" value. Otherwise, just tag the > xmax onto the cache somewhere and always check it before using the > cache (perhaps make it part of the RelationIsValid macro).
Transactions don't necessarily commit in XID order, so this might be an optimization to keep older transactions from having to do unnecessary rebuilds -- which I actually doubt is a major problem, but maybe I'm wrong -- but you can't rely solely on this as a way of deciding which transactions will see the effects of some change. If transactions 100, 101, and 102 begin in that order, and transaction 101 commits, there's no principled justification for 102 seeing its effects but 100 not seeing it. > This would > also require that we move away from SnapshotAny type catalogue scans > in favour of MVCC scans so that backends populating their relcache > build it based on their current xid. I think this is a somewhat confused analysis. We don't use SnapshotAny for catalog scans, and we never have. We used to use SnapshotNow, and we now use a current MVCC snapshot. What you're talking about, I think, is possibly using the transaction snapshot rather than a current MVCC snapshot for the catalog scans. I've thought about similar things, but I think there's a pretty deep can of worms. For instance, if you built a relcache entry using the transaction snapshot, you might end up building a seemingly-valid relcache entry for a relation that has been dropped or rewritten. When you try to access the relation data, you'll be attempt to access a relfilenode that's not there any more. Similarly, if you use an older snapshot to build a partition descriptor, you might thing that relation OID 12345 is still a partition of that table when in fact it's been detached - and, maybe, altered in other ways, such as changing column types. It seems to me that overall you're not really focusing on the right set of issues here. I think the very first thing we need to worry about how how we're going to keep the executor from following a bad pointer and crashing. Any time the partition descriptor changes, the next relcache rebuild is going to replace rd_partdesc and free the old one, but the executor may still have the old pointer cached in a structure or local variable; the next attempt to dereference it will be looking at freed memory, and kaboom. Right now, we prevent this by not allowing the partition descriptor to be modified while there are any queries running against the partition, so while there may be a rebuild, the old pointer will remain valid (cf. keep_partdesc). I think that whatever scheme you/we choose here should be tested with a combination of CLOBBER_CACHE_ALWAYS and multiple concurrent sessions -- one of them doing DDL on the table while the other runs a long query. Once we keep it from blowing up, the second question is what the semantics are supposed to be. It seems inconceivable to me that the set of partitions that an in-progress query will scan can really be changed on the fly. I think we're going to have to rule that if you add or remove partitions while a query is running, we're going to scan exactly the set we had planned to scan at the beginning of the query; anything else would require on-the-fly plan surgery to a degree that seems unrealistic. That means that when a new partition is attached, already-running queries aren't going to scan it. If they did, we'd have big problems, because the transaction snapshot might see rows in those tables from an earlier time period during which that table wasn't attached. There's no guarantee that the data at that time conformed to the partition constraint, so it would be pretty problematic to let users see it. Conversely, when a partition is detached, there may still be scans from existing queries hitting it for a fairly arbitrary length of time afterwards. That may be surprising from a locking point of view or something, but it's correct as far as MVCC goes. Any changes made after the DETACH operation can't be visible to the snapshot being used for the scan. Now, what we could try to change on the fly is the set of partitions that are used for tuple routing. For instance, suppose we're inserting a long stream of COPY data. At some point, we attach a new partition from another session. If we then encounter a row that doesn't route to any of the partitions that existed at the time the query started, we could - instead of immediately failing - go and reload the set of partitions that are available for tuple routing and see if the new partition which was concurrently added happens to be appropriate to the tuple we've got. If so, we could route the tuple to it. But all of this looks optional. If new partitions aren't available for insert/update tuple routing until the start of the next query, that's not a catastrophe. The reverse direction might be more problematic: if a partition is detached, I'm not sure how sensible it is to keep routing tuples into it. On the flip side, what would break, really? Given the foregoing, I don't see why you need something like PartitionIsValid() at all, or why you need an algorithm similar to CREATE INDEX CONCURRENTLY. The problem seems mostly different. In the case of CREATE INDEX CONCURRENTLY, the issue is that any new tuples that get inserted while the index creation is in progress need to end up in the index, so you'd better not start building the index on the existing tuples until everybody who might insert new tuples knows about the index. I don't see that we have the same kind of problem in this case. Each partition is basically a separate table with its own set of indexes; as long as queries don't end up with one notion of which tables are relevant and a different notion of which indexes are relevant, we shouldn't end up with any table/index inconsistencies. And it's not clear to me what other problems we actually have here. To put it another way, if we've got the notion of "isvalid" for a partition, what's the difference between a partition that exists but is not yet valid and one that exists and is valid? I can't think of anything, and I have a feeling that you may therefore be inventing a lot of infrastructure that is not necessary. I'm inclined to think that we could drop the name CONCURRENTLY from this feature altogether and recast it as work to reduce the lock level associated with partition attach/detach. As long as we have a reasonable definition of what the semantics are for already-running queries, and clear documentation to go with those semantics, that seems fine. If a particular user finds the concurrent behavior too strange, they can always perform the DDL in a transaction that uses LOCK TABLE first, removing the concurrency. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company