Hi everyone,

I've been trialling different inheritance schemes for partitioning to a large 
number of tables.  I am looking at ~1e9 records, totaling ~200GB.

I've found that a k-ary table inheritance tree works quite well to reduce the 
O(n) CHECK constraint overhead [1] in the query planner when enabling partition 
constraint exclusion.

I've played with binary (k=2) trees, and have found that query planning time is 
shorter for shallow trees where k>>2. (It appears that "more work" spent 
checking CHECK constraints is faster than to recur down the inheritance tree.  
Is this because fewer table locks are involved?)

A given tree structure (e.g. k=16) has a good query-plan time for SELECT 
queries in my case.  The query-plan times, however, for UPDATE and DELETE are 
unfortunately quite quite bad.  (I was surprised that query-planning time was 
not similar across all three queries?)

My machine swaps wildly when PostgreSQL plans an UPDATE or DELETE.  It does not 
swap for the SELECT query planning at all.  There is no noticeable memory 
growth by the postgres process for the SELECT plans.  There is huge memory 
usage growth when running a query-plan for UPDATE or DELETE.  The difference is 
something like going from 50MB to over 10GB of the process' virtual memory.

I'm trialling PostgreSQL on a MacBook Pro having 8GB physical RAM.


Here's an example, where the DDL for the inheritance tree [2] is generated by a 
Python script [3].

1. Query planning time for a SELECT query

> $ echo "explain select * from ptest where id = 34324234; \q" | time -p psql 
> ptest
>                                  QUERY PLAN
> -------------------------------------------------------------------------------
> Result  (cost=0.00..160.00 rows=48 width=4)
>   ->  Append  (cost=0.00..160.00 rows=48 width=4)
>         ->  Seq Scan on ptest  (cost=0.00..40.00 rows=12 width=4)
>               Filter: (id = 34324234)
>         ->  Seq Scan on ptest_0 ptest  (cost=0.00..40.00 rows=12 width=4)
>               Filter: (id = 34324234)
>         ->  Seq Scan on ptest_0_4 ptest  (cost=0.00..40.00 rows=12 width=4)
>               Filter: (id = 34324234)
>         ->  Seq Scan on ptest_0_4_1 ptest  (cost=0.00..40.00 rows=12 width=4)
>               Filter: (id = 34324234)
> (10 rows)
> 
> real         0.99
> user         0.00
> sys          0.00
> $

2. Query planning time for a DELETE query

> $ echo "explain delete from ptest where id = 34324234; \q" | time -p psql 
> ptest
>                               QUERY PLAN
> -------------------------------------------------------------------------
> Delete  (cost=0.00..160.00 rows=48 width=6)
>   ->  Seq Scan on ptest  (cost=0.00..40.00 rows=12 width=6)
>         Filter: (id = 34324234)
>   ->  Seq Scan on ptest_0 ptest  (cost=0.00..40.00 rows=12 width=6)
>         Filter: (id = 34324234)
>   ->  Seq Scan on ptest_0_4 ptest  (cost=0.00..40.00 rows=12 width=6)
>         Filter: (id = 34324234)
>   ->  Seq Scan on ptest_0_4_1 ptest  (cost=0.00..40.00 rows=12 width=6)
>         Filter: (id = 34324234)
> (9 rows)
> 
> real       317.14
> user         0.00
> sys          0.00
> $

3. Query planning time for an UPDATE query

> $ echo "explain update ptest set id = 34324235 where id = 34324234;
> \q" | time -p psql ptest
>                               QUERY PLAN
> -------------------------------------------------------------------------
> Update  (cost=0.00..160.00 rows=48 width=6)
>   ->  Seq Scan on ptest  (cost=0.00..40.00 rows=12 width=6)
>         Filter: (id = 34324234)
>   ->  Seq Scan on ptest_0 ptest  (cost=0.00..40.00 rows=12 width=6)
>         Filter: (id = 34324234)
>   ->  Seq Scan on ptest_0_4 ptest  (cost=0.00..40.00 rows=12 width=6)
>         Filter: (id = 34324234)
>   ->  Seq Scan on ptest_0_4_1 ptest  (cost=0.00..40.00 rows=12 width=6)
>         Filter: (id = 34324234)
> (9 rows)
> 
> real       331.72
> user         0.00
> sys          0.00
> $


Query planning on the leaf nodes works properly for all query-types:

> $ echo "explain delete from ptest_0_4_1 where id = 34324234; \q" | time -p 
> psql ptest
>                             QUERY PLAN                             
> -------------------------------------------------------------------
>  Delete  (cost=0.00..40.00 rows=12 width=6)
>    ->  Seq Scan on ptest_0_4_1  (cost=0.00..40.00 rows=12 width=6)
>          Filter: (id = 34324234)
> (3 rows)
> 
> real         0.01
> user         0.00
> sys          0.00
> 
> $ echo "explain update ptest_0_4_1 set id = 34324235 where id = 34324234; \q" 
> | time -p psql ptest
>                             QUERY PLAN                             
> -------------------------------------------------------------------
>  Update  (cost=0.00..40.00 rows=12 width=6)
>    ->  Seq Scan on ptest_0_4_1  (cost=0.00..40.00 rows=12 width=6)
>          Filter: (id = 34324234)
> (3 rows)
> 
> real         0.01
> user         0.00
> sys          0.00
> $ 


With SELECT constraint exclusion working, I can define plpgsql functions to 
UPDATE or DELETE the leaf tables directly, but using such an interface isn't 
terribly elegant.

I therefore tried writing the plpgsql functions for UPDATE and DELETE anyway, 
with the idea of linking to a TRIGGER on the parent ptest table.  This didn't 
work as expected either, unless I polluted my application's SQL queries with 
the "ONLY" keyword to make sure the trigger fired [4].


Is the query-planning times and memory use as demonstrated above normal?  I am 
hoping this is just a defect in the query-planner that we might be able to fix 
so that PostgreSQL can manage my large data set with more ease.

Any advice appreciated,

John


[1] 
http://wiki.postgresql.org/wiki/Table_partitioning#SELECT.2C_UPDATE.2C_DELETE
[2] http://jpap.org/files/partition-test.txt
[3] http://jpap.org/files/partition-test.py
[4] http://archives.postgresql.org/pgsql-hackers/2008-11/msg01883.php


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to