Re: [PERFORM] incorrect row estimates for primary key join

Ben Tue, 25 Jun 2013 17:30:13 -0700

On Jun 25, 2013, at 4:36 PM, Tom Lane wrote:

>> That seems intuitive, but some of the estimates need to be made
>> before all such information is available.  Maybe we can do
>> something about that some day....
>> Maybe someone else will jump in here with more details than I can
>> provide (at least without hours digging in the source code).
> 
> It does not attempt to match up query WHERE clauses with indexes during
> selectivity estimation, so the existence of a multi-column unique
> constraint wouldn't help it improve the estimate.


thanks tom, that answered my question.

> In the case at hand, I doubt that a better result rowcount estimate
> would have changed the planner's opinion of how to do the join.  The OP
> seems to be imagining that 2 million index probes into a large table
> would be cheap, but that's hardly true.  It's quite likely that the
> mergejoin actually is the best way to do the query.  If it isn't really
> best on his hardware, I would think that indicates a need for some
> tuning of the cost parameters.  Another thing that might be helpful for
> working with such large tables is increasing work_mem, to make hashes
> and sorts run faster.

i apologize if i seemed like i was presuming to know what the best query plan 
is.  i fully understand that the query planner sometimes makes unintuitive 
decisions which turn out to be for the best, having experienced it first hand 
many times.  since i've nudged my company to use postgresql (instead of 
mysql/sqlite), we've been very happy with it.  also, having tried my hand (and 
failing) at making good gist selectivity estimators, i think i've got a 
not-completely-ignorant 10,000 ft view of the trade-offs it tries to make, when 
sequential scans are better than repeated index lookups, et cetera.  i'm 
writing because i found this example, which shows yet another thing i don't 
understand about the query planner, and i am trying to learn better about it.

you've already answered my main question (whether or not unique constraints are 
used to help row estimation.)  there's a couple more issues which i don't quite 
understand :

1) when i give a hint to the query planner to not expect more than 
number-of-rows-in-jointable (via a limit), switches to a nested loop + index 
scan, but with the same row estimates.  i'll show the plan i had in the first 
email :

Limit  (cost=0.00..178452647.11 rows=2500000 width=28)
  ->  Nested Loop  (cost=0.00..1306127545.35 rows=18297957 width=28)
        ->  Seq Scan on jointable  (cost=0.00..35867.11 rows=2066911 width=24)
        ->  Index Scan using bigtable_pkey on bigtable  (cost=0.00..631.88 
rows=1 width=28)
              Index Cond: ((id1 = jointable.id1) AND (id2 = jointable.id2) AND 
(id3 = jointable.id3) AND (id4 = jointable.id4) AND (id5 = jointable.id5))
(5 rows)

before, i was misreading this as saying the planner was going to execute the 
nested loop fully (rows=18 million), and then limit the results.  i am now 
reading it as saying that the inner nested loop will be short-circuited after 
it generates enough rows.  if this is true, it seems to imply that, in query 
plan with deeply nested inner nested loops, one should read the inner loop row 
estimates with a grain of salt, as there might be limits (potentially many 
levels outwards) which can short-circuit them.  am i wrong about this?

2) when doing the sort+merge join, it choses to sort bigtable rather than use 
an index scan.  i've tried to give hints by requesting the results come in 
primary key order, but it keeps sorting by a permutation of the primary key and 
then resorting the join results at the end.  so obviously the random seek cost 
dominates the sequential read + sort (which i find surprising, but again i am 
happy to be surprised.)  that seems fine for a query which is going to touch 
the whole table.  but i can't seem to come up with a query which would ever 
favor using an index scan.  for example this :

explain select * from bigtable order by (id1, id2, id3, id4, id5) limit 1;
                                         QUERY PLAN                             
            
--------------------------------------------------------------------------------------------
 Limit  (cost=91923132.04..91923132.04 rows=1 width=28)
   ->  Sort  (cost=91923132.04..99842047.88 rows=3167566336 width=28)
         Sort Key: (ROW(id1, id2, id3, id4, id5))
         ->  Seq Scan on bigtable  (cost=0.00..76085300.36 rows=3167566336 
width=28)
(4 rows)

(apologies bigtable has grown since i've first started this thread.)  shouldn't 
an index scan definitely be fastest here?  you don't need to touch the whole 
table or index.  maybe there something i have misconfigured here?

best regards, ben

-- 
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] incorrect row estimates for primary key join

Reply via email to