Thanks Tom and Heikki! I really appreciate your help. I went ahead and loaded all the data. In the icream table, I now have ~175 partitions, each with 4041866 records.
The data finished loading 12 hours ago, and I then ran the same query I gave you guys, and it took 25 seconds since it used the wrong execution plan as expected. HashAggregate (cost=27680.90..28045.81 rows=11228 width=41) (actual time=24769.190..24817.618 rows=11028 loops=1) -> Hash Join (cost=1304.04..18901.88 rows=292634 width=41) (actual time=3938.965..24688.718 rows=11028 loops=1) Hash Cond: (rc.widget_id = r.widget_id) -> Append (cost=0.00..12110.95 rows=292634 width=24) (actual time=2854.925..22887.638 rows=309579 loops=1) -> Seq Scan on icecream rc (cost=0.00..25.60 rows=1 width=24) (actual time=0.003..0.003 rows=0 loops=1) Filter: ((widgetset_id = 5) AND (dataset_id = 283)) -> Index Scan using icecream_part_283_widgetset_id_idx on icecream_part_283 rc (cost=0.00..12085.35 rows=292633 width=24) (actual time=2854.915..2 1784.769 rows=309579 loops=1) Index Cond: (widgetset_id = 5) Filter: (dataset_id = 283) -> Hash (cost=1163.69..1163.69 rows=11228 width=21) (actual time=1083.704..1083.704 rows=11028 loops=1) Buckets: 2048 Batches: 1 Memory Usage: 604kB -> Append (cost=4.28..1163.69 rows=11228 width=21) (actual time=528.216..1066.659 rows=11028 loops=1) -> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1 width=48) (actual time=528.017..528.017 rows=0 loops=1) Recheck Cond: (widgetset_id = 5) Filter: (widget_type_id = 4) -> Bitmap Index Scan on widget_widgetset_id_idx (cost=0.00..4.28 rows=4 width=0) (actual time=527.995..527.995 rows=0 loops=1) Index Cond: (widgetset_id = 5) -> Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r (cost=0.00..1150.94 rows=11227 width=21) (actual time=0.191..512.847 rows=1 1028 loops=1) Index Cond: (widget_type_id = 4) Filter: (widgetset_id = 5) Total runtime: 24844.016 ms (21 rows) I then changed my enable* options to force it to use a nested loop: set enable_mergejoin=off; set enable_hashjoin=off; set enable_nestloop = on; HashAggregate (cost=460004.79..460369.70 rows=11228 width=41) (actual time=298.014..341.822 rows=11028 loops=1) -> Nested Loop (cost=4.28..338747.85 rows=4041898 width=41) (actual time=0.175..248.529 rows=11028 loops=1) Join Filter: (r.widget_id = rc.widget_id) -> Append (cost=4.28..1163.69 rows=11228 width=21) (actual time=0.053..42.532 rows=11028 loops=1) -> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1 width=48) (actual time=0.014..0.014 rows=0 loops=1) Recheck Cond: (widgetset_id = 5) Filter: (widget_type_id = 4) -> Bitmap Index Scan on widget_widgetset_id_idx (cost=0.00..4.28 rows=4 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (widgetset_id = 5) -> Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r (cost=0.00..1150.94 rows=11227 width=21) (actual time=0.032..18.410 rows=11028 lo ops=1) Index Cond: (widget_type_id = 4) Filter: (widgetset_id = 5) -> Append (cost=0.00..29.99 rows=6 width=24) (actual time=0.009..0.012 rows=1 loops=11028) -> Seq Scan on icecream rc (cost=0.00..23.00 rows=5 width=24) (actual time=0.001..0.001 rows=0 loops=11028) Filter: (rc.dataset_id = 283) -> Index Scan using icecream_part_283_widget_id_idx on icecream_part_283 rc (cost=0.00..6.99 rows=1 width=24) (actual time=0.004..0.006 rows=1 loo ps=11028) Index Cond: (rc.widget_id = r.widget_id) Filter: (rc.dataset_id = 283) Total runtime: 361.180 ms (19 rows) The query was nice and fast as expected. I then restored the enable* options back to default. The query was slow again and taking around ~19 seconds. So, this gives us some information about whether autoanalyze is running in the background: -I don't think waiting is making a difference. I loaded the data 12 hours ago, and then ran the query and the query was very slow. I then ran the query again but for a different dataset and it was also slow. It was only once I changed the enable* parameters could I get my expected performance. -I would have tried select * from ice-cream again, but with 176 partitions this query is no longer feasible. But, I do agree that select * from icecream is causing the statistics to be updated. So, then I followed Tom's suggestion to defeat the empty-table heuristic. select relpages from pg_class where relname = 'icecream'; relpages ---------- 0 (1 row) Ok, so the planner thinks that the parent table is empty. I then ran: update pg_class set relpages = 1 where relname = 'icecream'; HashAggregate (cost=201199.27..201564.18 rows=11228 width=41) (actual time=277.195..304.620 rows=11028 loops=1) -> Nested Loop (cost=4.28..79942.45 rows=4041894 width=41) (actual time=0.227..231.181 rows=11028 loops=1) Join Filter: (r.widget_id = rc.widget_id) -> Append (cost=4.28..1163.69 rows=11228 width=21) (actual time=0.125..40.834 rows=11028 loops=1) -> Bitmap Heap Scan on widget r (cost=4.28..12.75 rows=1 width=48) (actual time=0.022..0.022 rows=0 loops=1) Recheck Cond: (widgetset_id = 5) Filter: (widget_type_id = 4) -> Bitmap Index Scan on widget_widgetset_id_idx (cost=0.00..4.28 rows=4 width=0) (actual time=0.019..0.019 rows=0 loops=1) Index Cond: (widgetset_id = 5) -> Index Scan using widget_part_5_widget_widget_type_id_idx on widget_part_5 r (cost=0.00..1150.94 rows=11227 width=21) (actual time=0.100..18.964 rows=11028 lo ops=1) Index Cond: (widget_type_id = 4) Filter: (widgetset_id = 5) -> Append (cost=0.00..6.99 rows=2 width=24) (actual time=0.008..0.012 rows=1 loops=11028) -> Seq Scan on icecream rc (cost=0.00..0.00 rows=1 width=24) (actual time=0.001..0.001 rows=0 loops=11028) Filter: (rc.dataset_id = 283) -> Index Scan using icecream_part_283_widget_id_idx on icecream_part_283 rc (cost=0.00..6.99 rows=1 width=24) (actual time=0.004..0.006 rows=1 loo ps=11028) Index Cond: (rc.widget_id = r.widget_id) Filter: (rc.dataset_id = 283) Total runtime: 318.634 ms (19 rows) Wow! that fixes it. Thanks you so much!!!! I've been struggling with this issue for 2-3 days. (Also, in the past, I've seen inconsistent performance with this query, which may be the result of the planner sometimes choosing the wrong plan, but I'll chase that down later). Tom said: But maybe we should reconsider the heuristic for tables that are members of inheritance trees --- particularly parents of inheritance trees. I agree. I think postgres should get updated to take this into account. I shouldn't have to set the relpages to 1 for all the empty parent tables that I have partitioned. Should I file this as a bug/enhancement? Also, do I need to worry about about autoanalyze/autovacuum setting back relpages to zero for the parent icecream table? thanks!!! Anish On Sat, Jul 9, 2011 at 10:43 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Heikki Linnakangas <heikki.linnakan...@enterprisedb.com> writes: > > On 09.07.2011 00:36, Anish Kejariwal wrote: > >> My guess as to what happened: > >> -because the icecream parent table has zero records, the query optimizer > >> chooses the incorrect execution plan > >> -when I do select * from icecream, the optimizer now knows how many > records > >> are really in the icecream table, by knowing that the icecream table has > >> partitions. > > > "select * from icecream" won't have any direct effect on the > > optimization of subsequent queries. What probably happened is that > > autoanalyze ran in the background while you ran that select, and > > analyzed some of the partitions. Simply waiting a while would've had the > > same effect. > > Yeah. Also, the reason that a manual vacuum on icecream changes things > yet again is that in 9.0 and up, we have a notion of summary stats > across the whole inheritance tree, but autoanalyze hasn't been taught to > gather those. The manual command on the parent table does gather them, > though. > > So what's happening here is that we suddenly have an accurate idea of > the size of the join product as a result of having inheritance summary > stats to estimate with, and that drives the estimated cost of the merge > or hash join down out of the stratosphere. The estimated cost of the > nestloop goes down a lot too, but not as much. > > I experimented with a similar case here, and it seems like a lot of the > remaining error in the nestloop estimate comes from this: > > >> -> Seq Scan on icecream rc (cost=0.00..23.00 rows=5 width=24) (actual > time=0.002..0.002 rows=0 loops=1) > >> Filter: (dataset_id = 281) > > The indexscan on the nonempty child partition is estimated at less than > 10 cost units, so this is a *large* fraction of what the planner sees as > the per-outer-row cost of a nestloop. And with more than 11000 rows on > the other side of the join, that discourages it from using the nestloop. > In reality of course this takes negligible time compared to examining > the child partition. > > Now why is the seqscan cost estimate so large, when actually the parent > icecream table is totally empty? It's because the planner has been > taught to never believe that an empty table is empty. If memory serves, > it's really estimating on an assumption that the table contains 10 pages > and some corresponding number of rows. This is a reasonable defensive > posture when dealing with ordinary tables, I think, since most likely > if the catalogs say the table is empty that's just a leftover from when > it was created. But maybe we should reconsider the heuristic for tables > that are members of inheritance trees --- particularly parents of > inheritance trees. > > I was able to defeat the empty-table heuristic here by doing > > update pg_class set relpages = 1 where relname = 'icecream'; > > and then I started getting much more realistic estimates in my test > case. (It still wanted to use a merge join initially, but after > knocking down random_page_cost it went to the nestloop.) It would > be interesting to see what sorts of results Anish gets with that. > > regards, tom lane >