On Fri, Aug 29, 2014 at 11:56 AM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > In this design, partitions are first-class objects, not normal tables in > inheritance hierarchies. There are no pg_inherits entries involved at all.
Whoa. I always assumed that table inheritance was a stepping-stone to real partitioning, and that real partitioning would be built on top of table inheritance. In particular, I assume that (as Itagaki Takahiro's patch did those many years ago) we'd add some metadata somewhere to allow fast tuple routing (for both pruning and inserts/updates). What's the benefit of inventing something new instead? I'm skeptical about your claim that there will be no pg_inherits entries involved at all. You need some way to know which partitions go with which parent table. You can store that many-to-one mapping someplace other than pg_inherits, but it seems to me that that doesn't buy you anything; they're just pg_inherits entries under some other name. Why reinvent that? > Each partition is assigned an Expression that receives a tuple and > returns boolean. This expression returns true if a given tuple belongs > into it, false otherwise. If a tuple for a partitioned relation is run > through expressions of all partitions, exactly one should return true. > If none returns true, it might be because the partition has not been > created yet. A user-facing error is raised in this case (Rationale: if > user creates a partitioned rel and there is no partition that accepts > some given tuple, it's the user's fault.) > > Additionally, each partitioned relation may have a master expression. > This receives a tuple and returns an integer, which corresponds to the > number of the partition it belongs into. I agree with Tom: this is a bad design. In particular, if we want to scale to large numbers of partitions (a principal weakness of the present system) we need the operation of routing a tuple to a partition to be as efficient as possible. Range partitioning can be O(lg n) where n is the number of partitions: store a list of the boundaries and binary-search it. List partitioning can be O(lg k) where k is the number of values (which may be more than the number of partitions) via a similar technique. Hash partitioning can be O(1). I'm not sure what other kind of partitioning anybody would want to do, but it's likely that they *won't* want it to be O(1) in the number of partitions. So I'd say have *only* the master expression. But, really, I don't think an expression is the right way to store this; evaluating that repeatedly will, I think, still be too slow. Think about what happens in PL/pgsql: minimizing the number of times that you enter and exit the executor helps performance enormously, even if the expressions are simple enough not to need planning. I think the representation should be more like an array of partition boundaries and the pg_proc OID of a comparator. > Per-partition expressions are formed as each partition is created, and > are based on the user-supplied partitioning criterion. Master > expressions are formed at relation creation time. (XXX Can we change > the master expression later, as a result of some ALTER command? > Presumably this would mean that all partitions might need to be > rewritten.) This is another really important point. If you store an opaque expression mapping partitioning keys to partition numbers, you can't do things like this efficiently. With a more transparent representation, like a sorted array of partition boundaries for range partitioning, or a sorted array of hash values for consistent hashing, you can do things like split and merge partitions efficiently, minimizing rewriting. > Planner ------- > > A partitioned relation behaves just like a regular relation for purposes > of planner. XXX do we need special considerations regarding relation > size estimation? > > For scan plans, we need to prepare Append lists which are used to scan > for tuples in a partitioned relation. We can setup fake constraint > expressions based on the partitioning expressions, which let the planner > discard unnecessary partitions by way of constraint exclusion. So if we're going to do all this, why bother making the partitions anything other than inheritance children? There might be some benefit in having the partitions be some kind of stripped-down object if we could avoid some of these planner gymnastics and get, e.g. efficient run-time partition pruning. But if you're going to generate Append plans and switch ResultRelInfos and stuff just as you would for an inheritance hierarchy, why not just make it an inheritance hierarchy? It seems pretty clear to me that we need partitioned tables to have the same tuple descriptor throughout the relation, for efficient tuple routing and so on. But the other restrictions you're proposing to impose on partitions have no obvious value that I can see. We could have a rule that when you inherit from a partition root, you can only inherit from that one table (no multiple inheritance) and your tuple descriptor must match precisely (down to dropped columns and column ordering) and that would give you everything I think you really need here. There's no gain to be had in forbidding partitions from having different owners, or being selected from directly, or having user-visible names. The first of those is arguably useless, but it's not really causing us any problems, and the latter two are extremely useful features. Unless you are going to implement partition pruning is so good that it will never fail to realize a situation where only one partition needs to be scanned, letting users target the partition directly is a very important escape hatch. > (In the future we might be interested in creating specialized plan and > execution nodes that know more about partitioned relations, to avoid > creating useless Append trees only to prune them later.) Good idea. > pg_dump is able to dump a partitioned relation as a CREATE > TABLE/PARTITION command and a series of ALTER TABLE/CREATE PARTITION > commands. The data of all partitions is considered a single COPY > operation. > > XXX this limits the ability to restore in parallel. To fix we might consider > using one COPY for each partition. It's not clear what relation should be > mentioned in such a COPY command, though -- my instinct is that it > should reference the parent table only, not the individual partition. Targeting the individual partitions seems considerably better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers