On Mon, May 23, 2016 at 6:31 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> Tom Lane wrote:
>> My feeling about it is that we need to provide a partitioning feature
>> that doesn't rely on the current notion of inheritance at all.  We've
>> heard from multiple users who want to use large numbers of partitions,
>> enough that simply having a separate relcache entry for each partition
>> would be a performance problem, never mind the current approach to
>> planning queries over inheritance trees.  So the partitions need to be
>> objects much simpler than full-fledged tables.
> Sorry to hijack the thread, but I agree on this, and I'm worried that
> the patch being floated for partitioning may paint us on a corner from
> which it may be difficult to get out.

This seems like a serious concern that merits further discussion.
What would a partition that is "much simpler than a full-fledged
table" look like?

>From my point of view, I don't see a real alternative to the direction
that Amit's partitioning patch takes.  But I'd be happy to be
convinced that I've got it all wrong, because I don't think it's going
to come close to fixing all of the problems that we have (though I
doubt that any one patch will do that).  However, let me lay out my
concerns so that, perhaps, somebody can suggest a better line of

1. We can't assume that all partitions are the same for purposes of
query planning.  One of the big problems with the status quo at least
in my view is that, as Tom says, we end up with separate paths for
every partition, separate RelOptInfo structures for every partition,
separate relcache and catcache entries for each partition.  Planning
is slow and memory usage is high because all of the work that normally
needs to be done for each table now has to be done for each partition.
Worse, while a query isn't likely to involve more than a dozen or so
tables, and often less, each of those tables could have tens,
hundreds, or even thousands of partitions, so the constant multipliers
are quite high.  It would be far better if we could consider only the
parent for planning purposes, or at least during early stages of
planning, and all of the partitions only later.  However, I see no way
to do that without generating terrible query plans in some cases where
it's clearly possible to do far better, because they need not be the
same size or the same data distribution.  The optimal strategy for
scanning partition A may be different than for partition B.  Of course
it will *often* be true that the right strategy is the same for every
partition, but I think exceptions will be fairly common.

More importantly, there are lots of really important query planner
optimizations that simply can't be done without considering each
partition separately.  For example, if you are joining two tables on
their partitioning key, you can do a partition-wise join rather than
joining the whole thing.  Every one of those joins may pick its own
strategy; e.g. the small partitions may use hash joins while the large
ones are forced to use merge joins due to memory constraints.  Some
partitions may be eliminated by constraints.  Or, to take an
optimization that we already have, getting sorted data from a family
of partitions is much more sensibly done with Merge Append than any
other strategy.  In either case, you can't devise or cost an
intelligent query plan without some understanding of the partition
structure.  At a minimum, you need to know the sizes of the various
partitioning and what sort of indexing is available.  In practice, you
probably want per-partition statistics and information about
constraints, too.  It's probably a bad idea to assume that every
partition has an equivalent data distribution.

2. TIDs are 6 bytes, and we have deeply ingrained assumptions about
what they mean.  So, you can't really have an index of the type that
we support today which slices across multiple partitions unless all of
those partitions share the same set of block numbers.  But what is the
physical layout of the underlying data?  If all of those block numbers
are in fact stored in a single relfilenode, you haven't really
partitioned the data and won't gain the anticipated benefits from the
ability to, say, drop a partition.  If they are stored in different
files, code that iterates over the range of available block numbers
will be performing random I/O rather than sequential I/O.  If it
weren't for those problems, one might imagine allowing cross-partition
indexes that all point into the same TID space, but as it is it seems
that would be require a lot of wrangling to make it work.  Also, it
doubles down on the 32TB table size limit, which is not as irrelevant
as it used to be.  I think if we want cross-partition indexes we
should approach that problem directly by having indexes that contain
8-byte ETIDs instead of 6-byte TIDs, which the extra 2 bytes storing a
partition designator - or something of that sort.

But I can't really see holding up the further development of
partitioning until somebody creates something like this.  Progress is
already far too slow.  So I think what we ought to do instead is
accept that partitions might have their own indexes in addition to any
sort of cross-partition index that we might someday allow.  Aside from
questions of implementation difficulty, there are good reasons to
think that single-partition indexes will often be superior.  It's
advantageous - whenever possible - to have indexes partitioned in the
same manner as the underlying table.  For example, VACUUM of a
single-partition index is likely to be a lot cheaper than VACUUM of a
cross-partition index.  When you remove TIDs from any one partition,
you have to scan the whole index, so it's faster if that index only
covers the one partition you care about.

And if you accept the idea that partitions ought to be allowed to have
single-partition indexes, then it seems to me that those indexes are
going to need pg_index and pg_class entries just like any other
indexes, and so are the partitions which are their parents.  Now, at
this point, I do not see how we end up with something that's much
simpler than a full-fledged table.

There's a lot to disagree with (or not) in the above paragraphs, but
let me mention a couple of things that seem like they might work.  It
seems possible that we might be able to set things up so that we don't
actually create relcache entries for all of the children unless they
are accessed directly.  For example, maybe we come up with some
compact representation of the minimum amount of data that's needed for
query planning, maybe involving folding together similar index
definitions across all children, and store all of this in the parent's
relcache entry.  We drive planning off that entry, which will be
larger than normal but not as large as keeping a complete relcache
entry for each child.  Similarly, in the planner, instead of creating
a RelOptInfo for every child straight off, we could try to have enough
information in the parent that we can do some partition elimination
before performing inheritance expansion.  Possibly the RelOptInfo
could be split into a part that is needed only for the parent and a
part that is needed for each child.  Maybe there are other ideas.

But to me, it seems like trying to minimize the representations of the
data we already have or do without them in simple cases is more
promising than trying to make a partition an altogether different and
simpler object than a table.  If it has or can have its own indexes -
and I don't see any good alternatives - than realistically it must
have pg_class entries and its indexes pg_index entries.  And even if
it doesn't, it will have to be considered to some degree a separate
object for planning purposes; or else one of the principle benefits of
partitioning, namely planner optimizations, won't happen.  In other
words, I don't see how a partition can be radically simpler than a
table without losing a lot of the benefits of partitioning.

But I'm open to suggestions.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to