Re: [HACKERS] Declarative partitioning - another take

Tomas Vondra Mon, 12 Dec 2016 07:18:52 -0800

On 12/12/2016 07:37 AM, Amit Langote wrote:


Hi Tomas,

On 2016/12/12 10:02, Tomas Vondra wrote:


2) I'm wondering whether having 'table' in the catalog name (and also in
the new relkind) is too limiting. I assume we'll have partitioned indexes
one day, for example - do we expect to use the same catalogs?


I am not sure I understand your idea of partitioned indexes, but I doubt
it would require entries in the catalog under consideration.  Could you
perhaps elaborate more?

OK, let me elaborate. Let's say we have a partitioned table, and I wantto create an index. The index may be either "global" i.e. creating asingle relation for data from all the partitions, or "local" (i.e.partitioned the same way as the table).

Local indexes are easier to implement (it's essentially what we havenow, except that we need to create the indexes manually for eachpartition), and don't work particularly well for some use cases (e.g.unique constraints). This is what I mean by "partitioned indexes".

If the index is partitioned just like the table, we probably won't needto copy the partition key info (so, nothing in pg_partitioned_table).I'm not sure it makes sense to partition the index differently than thetable - I don't see a case where that would be useful.

The global indexes would work better for the unique constraint use case,but it clearly contradicts our idea of TID (no information about whichpartition that references).

So maybe the catalog really only needs to track info about tables? Notsure. I'm just saying it'd be unfortunate to have _table in the name,and end up using it for indexes too.

4) I see GetIndexOpClass() got renamed to ResolveOpClass(). I find the new
name rather strange - all other similar functions start with "Get", so I
guess "GetOpClass()" would be better. But I wonder if the rename was even
necessary, considering that it still deals with index operator classes
(although now also in context of partition keys). If the rename really is
needed, isn't that a sign that the function does not belong to indexcmds.c
anymore?


The fact that both index keys and partition keys have very similar
specification syntax means there would be some common code handling the
same, of which this is one (maybe, only) example.  I didn't see much point
in finding a new place for the function, although I can see why it may be
a bit confusing to future readers of the code.

About the new name - seeing the top line in the old header comment, what
the function does is resolve a possibly explicitly specified operator
class name to an operator class OID.  I hadn't changed the name in my
original patch, but Robert suggested the rename, which I thought made sense.


OK, I don't particularly like the name but I can live with that.

5) Half the error messages use 'child table' while the other half uses
'partition'. I think we should be using 'partition' as child tables really
have little meaning outside inheritance (which is kinda hidden behind the
new partitioning stuff).


One way to go about it may be to check all sites that can possibly report
an error involving child tables (aka "partitions") whether they know from
the context which name to use.  I think it's possible, because we have
access to the parent relation in all such sites and looking at the relkind
can tell whether to call child tables "partitions".

Clearly, this is a consequence of building the partitioning on top ofinheritance (not objecting to that approach, merely stating a fact).

I'm fine with whatever makes the error messages more consistent, if itdoes not make the code significantly more complex. It's a bit confusingwhen some use 'child tables' and others 'partitions'. I suspect even asingle DML command may return a mix of those, depending on where exactlyit fails (old vs. new code).

6) The psql tab-completion seems a bit broken, because it only offers the
partitions, not the parent table. Which is usually exactly the opposite of
what the user wants.


Actually, I have not implemented any support for tab-completion for the
new DDL.  I will work on that.

Aha! So the problem probably is that psql does not recognize the new'partitioned table' relkind, and only looks at regular tables.

testing
-------

2) When doing a SELECT COUNT(*) from the partitioned table, I get a plan
like this:

                                  QUERY PLAN
-------------------------------------------------------------------
 Finalize Aggregate  (cost=124523.64..124523.65 rows=1 width=8)
   ->  Gather  (cost=124523.53..124523.64 rows=1 width=8)
         Workers Planned: 1
         ->  Partial Aggregate  (cost=123523.53..123523.54 rows=1 ...)
               ->  Append  (cost=0.00..108823.53 rows=5880001 width=0)
                     ->  Parallel Seq Scan on parent ...
                     ->  Parallel Seq Scan on partition_1 ...
                     ->  Parallel Seq Scan on partition_2 ...
                     ->  Parallel Seq Scan on partition_3 ...
                     ->  Parallel Seq Scan on partition_4 ...
                     ->  Parallel Seq Scan on partition_5 ...
                     ->  Parallel Seq Scan on partition_6 ...
                     ... and the rest of the 10k partitions

So if I understand the plan correctly, we first do a parallel scan of the
parent, then partition_1, partition_2 etc. But AFAIK we scan the tables in
Append sequentially, and each partition only has 1000 rows each, making
the parallel execution rather inefficient. Unless we scan the partitions
concurrently.

In any case, as this happens even with plain inheritance, it's probably
more about the parallel query than about the new partitioning patch.


Yes, I have seen some discussion [2] about a Parallel Append, which would
result in plans you're probably thinking of.

Actually, I think that thread is about allowing partial paths even ifonly some appendrel members support it. My point is that in this caseit's a bit silly to even build the partial paths, when the partitionsonly have 1000 rows each.

I kinda suspect we only look at the total appendrel rowcount estimate,but haven't checked.

3) The last issue I noticed is that while

    EXPLAIN SELECT * FROM partitioned_table WHERE id = 1292323;

works just fine (it takes fair amount of time to plan with 10k partitions,
but that's expected), this

    EXPLAIN UPDATE partitioned_table SET x = 1 WHERE id = 1292323;

allocates a lot of memory (~8GB on my laptop, before it gets killed by OOM
killer). Again, the same thing happens with plain inheritance-based
partitioning, so it's probably not a bug in the partitioning patch.

I'm mentioning it here because I think the new partitioning will hopefully
get more efficient and handle large partition counts more efficiently (the
inheritance only really works for ~100 partitions, which is probably why
no one complained about OOM during UPDATEs). Of course, 10k partitions is
a bit extreme (good for testing, though).


Plans with the inheritance parents as target tables (update/delete)
go through inheritance_planner() in the optimizer, which currently
has a design that is based on certain assumptions about traditional
inheritance. It's possible hopefully to make it less expensive for
the partitioned  tables.

Yes, I know. I wasn't really reporting it as a bug in the partitioningpatch, but more as a rather surprising difference between plain SELECTand UPDATE.

Am I right that one of the ambitions of the new partitioning is toimprove behavior with large number of partitions?

At first I thought it's somewhat related to the FDW sharding (each nodebeing a partition and having local subpartitions), but I realize theplanner will only deal with the node partitions I guess.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

Reply via email to