On 21 August 2018 at 13:59, Robert Haas <robertmh...@gmail.com> wrote: > On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera > <alvhe...@2ndquadrant.com> wrote: >> I wonder if this all stems from a misunderstanding of what I suggested >> to David offlist. My suggestion was that the catalog scans would >> continue to use the catalog MVCC snapshot, and that the relcache entries >> would contain all the partitions that appear to the catalog; but each >> partition's entry would carry the Xid of the creating transaction in a >> field (say xpart), and that field is compared to the regular transaction >> snapshot: if xpart is visible to the transaction snapshot, then the >> partition is visible, otherwise not. So you never try to access a >> partition that doesn't exist, because those just don't appear at all in >> the relcache entry. But if you have an old transaction running with an >> old snapshot, and the partitioned table just acquired a new partition, >> then whether the partition will be returned as part of the partition >> descriptor or not depends on the visibility of its entry. > > Hmm. One question is where you're going to get the XID of the > creating transaction. If it's taken from the pg_class row or the > pg_inherits row or something of that sort, then you risk getting a > bogus value if something updates that row other than what you expect > -- and the consequences of that are pretty bad here; for this to work > as you intend, you need an exactly-correct value, not newer or older. > An alternative is to add an xid field that stores the value > explicitly, and that might work, but you'll have to arrange for that > value to be frozen at the appropriate time. > > A further problem is that there could be multiple changes in quick > succession. Suppose that a partition is attached, then detached > before the attach operation is all-visible, then reattached, perhaps > with different partition bounds.
I should probably post the WIP I have here. In those, I do have the xmin array in the PartitionDesc. This gets taken from the new pg_partition table, which I don't think suffers from the same issue as taking it from pg_class, since nothing else will update the pg_partition record. However, I don't think the xmin array is going to work if we include it in the PartitionDesc. The problem is, as I discovered from writing the code was that the PartitionDesc must remain exactly the same between planning an execution. If there are any more or any fewer partitions found during execution than what we saw in planning then run-time pruning will access the wrong element in the PartitionPruneInfo array, or perhaps access of the end of the array. It might be possible to work around that by identifying partitions by Oid rather than PartitionDesc array index, but the run-time pruning code is already pretty complex. I think coding it to work when the PartitionDesc does not match between planning and execution is just going to too difficult to get right. Tom is already unhappy with the complexity of ExecFindInitialMatchingSubPlans(). I think the solution will require that the PartitionDesc does not: a) Change between planning and execution. b) Change during a snapshot after the partitioned table has been locked. With b, it sounds like we'll need to take the most recent PartitionDesc even if the transaction is older than the one that did the ATTACH/DETACH operation as if we use an old version then, as Robert mentions, there's nothing to stop another transaction making changes to the table that make it an incompatible partition, e.g DROP COLUMN. This wouldn't be possible if we update the PartitionDesc right after taking the first lock on the partitioned table since any transactions doing DROP COLUMN would be blocked until the other snapshot gets released. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services