Here is a query on a partitioned schema that produces a very bad query plan.  
The tables are fully vacuumed, analyzed with stats target 40, and no bloat 
(created with pure inserts, no updates or deletes).

I already know of at least three bugs with the query planner and partitions 
listed at the end of this message.

Below are explain analyze examples on just one partition.  Expanding the date 
range to be multiple partitions only makes this worse.

First, the query written to go against the direct partitions, which is fast and 
does the right plan:
(although, wouldn't it use less memory if it hashed the small table, and 
scanned the larger one against that rather than vice-versa?  Or am I reading 
the explain wrong?)

rr=> explain analyze select pl.id, sl.terms from p_pp_logs_026_2008_12_17 pl, 
p_s_logs_026_2008_12_17 sl  where sl.session = pl.session and  sl.terms <> 
'null' and sl.terms is not null;                                                
                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=31470.89..37199.26 rows=38597 width=27) (actual 
time=743.782..772.705 rows=29907 loops=1)
   Hash Cond: ((pl.session)::text = (sl.session)::text)
   ->  Seq Scan on pp_logs_026_2008_12_17 pl  (cost=0.00..357.49 rows=8449 
width=46) (actual time=0.007..3.064 rows=8449 loops=1)
   ->  Hash  (cost=24822.35..24822.35 rows=531883 width=57) (actual 
time=743.597..743.597 rows=531453 loops=1)
         ->  Seq Scan on s_logs_026_2008_12_17 sl  (cost=0.00..24822.35 
rows=531883 width=57) (actual time=0.011..392.242 rows=532067 loops=1)
               Filter: ((terms IS NOT NULL) AND ((terms)::text <> 'null'::text))
 Total runtime: 779.431 ms

In the above, the estimated and actual rows are about right, only a bit off 
after the join.

We are partitioned by s_id and date, the general form of the query used for 
scanning more than one combination of these values fails.
In the below, which restricts the data to the same tables, the planner is 
extremely wrong (plus the parent tables which have one irrelevant row of dummy 
data so that analyze works on them):

explain analyze select pl.id, sl.terms from pp_logs pl, s_logs sl  where 
pl.s_id = 26 and sl.s_id = 26 and sl.session = pl.session and  sl.terms <> 
'null' and sl.terms is not null  and pl.date = '2008-12-17' and sl.date = 
'2008-12-17';
                                                                          QUERY 
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=79036.43..416160.16 rows=22472099 width=27) (actual 
time=7089.961..7839.665 rows=29907 loops=1)
   Merge Cond: ((pl.session)::text = (sl.session)::text)
   ->  Sort  (cost=968.98..990.10 rows=8450 width=46) (actual 
time=193.778..196.690 rows=8449 loops=1)
         Sort Key: pl.session
         Sort Method:  quicksort  Memory: 1043kB
         ->  Append  (cost=0.00..417.84 rows=8450 width=46) (actual 
time=36.824..133.157 rows=8449 loops=1)
               ->  Seq Scan on pp_logs pl  (cost=0.00..18.10 rows=1 width=21) 
(actual time=0.056..0.056 rows=0 loops=1)
                     Filter: ((s_id = 26) AND (date = '2008-12-17'::date))
               ->  Seq Scan on pp_logs_026_2008_12_17 pl  (cost=0.00..399.74 
rows=8449 width=46) (actual time=36.766..129.317 rows=8449 loops=1)
                     Filter: ((s_id = 26) AND (date = '2008-12-17'::date))
   ->  Sort  (cost=78067.45..79397.16 rows=531884 width=57) (actual 
time=6895.648..7094.701 rows=552412 loops=1)
         Sort Key: sl.session
         Sort Method:  quicksort  Memory: 92276kB
         ->  Append  (cost=0.00..27483.28 rows=531884 width=57) (actual 
time=33.759..1146.936 rows=532067 loops=1)
               ->  Seq Scan on s_logs sl  (cost=0.00..1.19 rows=1 width=22) 
(actual time=0.036..0.036 rows=0 loops=1)
                     Filter: ((terms IS NOT NULL) AND ((terms)::text <> 
'null'::text) AND (s_id = 26) AND (date = '2008-12-17'::date))
               ->  Seq Scan on s_logs_026_2008_12_17 sl  (cost=0.00..27482.09 
rows=531883 width=57) (actual time=33.721..903.607 rows=532067 loops=1)
                     Filter: ((terms IS NOT NULL) AND ((terms)::text <> 
'null'::text) AND (s_id = 26) AND (date = '2008-12-17'::date))
 Total runtime: 7861.723 ms

Some partitions have about 10x the data as the above, resulting in ~30x longer 
times to do the sort, but only 10x to hash.  The sort also uses far, far more 
memory than the hash should if it were to hash the small one and hash-join the 
large table against that.

As you can see in the above example, this inner join, which the planner knows 
has two arms of row count ~530K and ~8.5K, is predicted to have 22M output 
rows.  What?
Did it loose its ability to even roughly know n_distinct?  Can it use 
n_distinct for each table and combine them for a (weighted) worst case estimate 
and arrive at something less bizarre?
If I scan across more tables, the actual output grows roughly linearly, but the 
estimate looks like its A*ln(A*B) where A is the size of the larger arm and B 
is the size of the smaller one. Having an inner join result in more rows than 
one of the arms is difficult with n_distinct numbers like these as evidenced by 
the direct to table query estimate.
Perhaps the parent table (with one row) is getting in the way?

Any suggestions on how to get a good plan when accessing this via the parent 
table to scan multiple dates?


Known (to me) partition bugs related to the query planner / stats:
* Row width estimates for scans across multiple tables are wrong (just max(each 
table width) rather than a weighted average - this could have been fixed in 
8.3.4+ without me noticing though).
* Column selectivity not combined rationally for aggregate statistics on 
multiple table scans in an arm.  Example, two partitions may have an int column 
that has only 30 distinct values over 2M rows in each partition.  The planner 
will decide that the selectivity is 10% rather than the worst case 
(30+30)/(total rows) or best case (30/(total rows)).   There are many 
variations on this theme not just the above type of error.
* An empty partition (usually the parent table) will propagate default 
statistics through the errors above leading to even more strange query plans.  
Empty table default statistics are generally poor, but worse in the cases above 
since they override good stats.  Furthermore, ANALYZE won't write stats for an 
empty table so unless you put dummy rows in a table, you are guaranteed to get 
bad plans.

I have put a single dummy row in each parent table with minimum column width on 
text columns and analyzed it to help out on the above.  All partitions are 
fully vacuumed with no bloat and fully analyzed.  After a day, they become read 
only.

The only time I wish I have hints is when I'm working on partitioned tables, 
most other times minor query modifications plus good statistics get a good plan 
(but perhaps a less natural query syntax).  The planner is not adequate on 
partitioned tables.

Reply via email to