On Wed, Aug 17, 2016 at 2:21 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> I am slightly tempted to eliminate the pg_partition catalog and associated
> syscache altogether and add a column to pg_class as Robert suggested.
> That way, all relid_is_partition() calls will be replaced by
> rel->rd_partbound != NULL check.  But one potential problem with that
> approach is that now whenever a parent relation is opened, all the
> partition relations must be opened to get the partbound value (to form the
> PartitionDesc to be stored in parent relation's rd_partdesc).  Whereas
> currently, we just look up the pg_partition catalog (or the associated
> cache) for every partition and that gets us the partbound.

Well, you could just look up the pg_class row without opening the
relation, too.  There is a system cache on pg_class.oid, after all.  I
think the issue is whether it's safe to read either one of those
things without a lock on the child relation.  If altering the
partitioning information for a relation requires holding only
AccessExclusiveLock on that relation, and no lock on the parent, then
you really can't read the information for any child relation without
taking at least AccessShareLock.  Otherwise, it might change under
you, and that would be bad.

I'm inclined to think that changing the partitioning information for a
child is going to require AccessExclusiveLock on both the child and
the parent.  That seems unfortunate from a concurrency point of view,
but we may be stuck with it: suppose you require only
ShareUpdateExclusiveLock on the parent.  Well, then a concurrent read
transaction might see the partition boundaries change when it does a
relcache rebuild, which would cause it to suddenly start expecting the
data to be in a different plan in mid-transaction, perhaps even in
mid-scan.  Maybe that's survivable with really careful coding, but it
seems like it's probably a bad thing.  For example, it would mean that
the executor would be unable to rely on the partitioning information
in the relcache remaining stable underneath it.  Moreover, the
relcache is always going to be scanned with the most recent possible
MVCC snapshot, but the transaction snapshot may be older, so such a
system creates all sorts of nasty possibilities for there to be skew
between the snapshot being used to via the data and the snapshot being
used to read the metadata that says where the data is.

This may need some more thought, but if we go with that approach of
requiring an AccessExclusiveLock on both parent and child, then it
seems to me that maybe we should consider the partitioning information
to be a property of the parent rather than the child.  Just take all
the partitioning information for all children and put it in one big
node tree and store it in the pg_class or pg_partition_root entry for
the parent as one big ol' varlena.  Now you can open the parent and
get all of the partitioning information for all of the children
without needing any lock on any child, and that's *really* good,
because it means that some day we might be able to do partition
elimination before locking any of the children!  That would be

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to