On Fri, Nov 9, 2018 at 9:50 AM Robert Haas <robertmh...@gmail.com> wrote: > I had another idea, too. I think we might be able to reuse the > technique Noah invented in 4240e429d0c2d889d0cda23c618f94e12c13ade7. > That is: > > - make a note of SharedInvalidMessageCounter before doing any of the > relevant catalog lookups > - do them > - AcceptInvalidationMessages() > - if SharedInvalidMessageCounter has changed, discard all the data we > collected and retry from the top > > I believe that is sufficient to guarantee that whatever we construct > will have a consistent view of the catalogs which is the most recent > available view as of the time we do the work. And with this approach > I believe we can continue to use syscache lookups to get the data > rather than having to use actual index scans, which is nice.
Here are a couple of patches to illustrate this approach to this part of the overall problem. 0001 is, I think, a good cleanup that may as well be applied in isolation; it makes the code in RelationBuildPartitionDesc both cleaner and more efficient. 0002 adjust things so that - I hope - the partition bounds we get for the individual partitions has to be as of the same point in the commit sequence as the list of children. As I noted before, Alvaro's patch doesn't seem to have tackled this part of the problem. Another part of the problem is finding a way to make sure that if we execute a query (or plan one), all parts of the executor (or planner) see the same PartitionDesc for every relation. In the case of parallel query, I think it's important to try to get consistency not only within a single backend but also across backends. I'm thinking about perhaps creating an object with a name like PartitionDescDirectory which can optionally attach to dynamic shared memory. It would store an OID -> PartitionDesc mapping in local memory, and optionally, an additional OID -> serialized-PartitionDesc in DSA. When given an OID, it would check the local hash table first, and then if that doesn't find anything, check the shared hash table if there is one. If an entry is found there, deserialize and add to the local hash table. We'd then hang such a directory off of the EState for the executor and the PlannerInfo for the planner. As compared with Alvaro's proposal, this approach has the advantage of not treating parallel query as a second-class citizen, and also of keeping partitioning considerations out of the snapshot handling, which as I said before seems to me to be a good idea. One thing which was vaguely on my mind in earlier emails but which I think I can now articulate somewhat more clearly is this: In some cases, a consistent but outdated view of the catalog state is just as bad as an inconsistent view of the catalog state. For example, it's not OK to decide that a tuple can be placed in a certain partition based on an outdated list of relation constraints, including the partitioning constraint - nor is it OK to decide that a partition can be pruned based on an old view of the partitioning constraint. I think this means that whenever we change a partition's partitioning constraint, we MUST take AccessExclusiveLock on the partition. Otherwise, a heap_insert() [or a partition pruning decision] can be in progress on that relation in one relation at the same time that some other partition is changing the partition constraint, which can't possibly work. The best we can reasonably do is to reduce the locking level on the partitioned table itself. A corollary is that holding the PartitionDescs that a particular query sees for a particular relation fixed, whether by the method Alvaro proposes or by what I am proposing here or by some other method is not a panacea. For example, the ExecPartitionCheck call in copy.c sometimes gets skipped on the theory that if tuple routing has sent us to partition X, then the tuple being routed must satisfy the partition constraint for that partition. But that's not true if we set up tuple routing using one view of the catalogs, and then things changed afterwards. RelationBuildPartitionDesc doesn't lock the children whose relpartbounds it is fetching (!), so unless we're guaranteed to have already locked them children earlier for some other reason, we could grab the partition bound at this point and then it could change again before we get a lock on them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
0002-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patch
Description: Binary data
0001-Reduce-unnecessary-list-construction-in-RelationBuil.patch
Description: Binary data