On Jun 25, 2013, at 4:36 PM, Tom Lane wrote:
>> That seems intuitive, but some of the estimates need to be made
>> before all such information is available. Maybe we can do
>> something about that some day....
>> Maybe someone else will jump in here with more details than I can
>> provide (at least without hours digging in the source code).
>
> It does not attempt to match up query WHERE clauses with indexes during
> selectivity estimation, so the existence of a multi-column unique
> constraint wouldn't help it improve the estimate.
thanks tom, that answered my question.
> In the case at hand, I doubt that a better result rowcount estimate
> would have changed the planner's opinion of how to do the join. The OP
> seems to be imagining that 2 million index probes into a large table
> would be cheap, but that's hardly true. It's quite likely that the
> mergejoin actually is the best way to do the query. If it isn't really
> best on his hardware, I would think that indicates a need for some
> tuning of the cost parameters. Another thing that might be helpful for
> working with such large tables is increasing work_mem, to make hashes
> and sorts run faster.
i apologize if i seemed like i was presuming to know what the best query plan
is. i fully understand that the query planner sometimes makes unintuitive
decisions which turn out to be for the best, having experienced it first hand
many times. since i've nudged my company to use postgresql (instead of
mysql/sqlite), we've been very happy with it. also, having tried my hand (and
failing) at making good gist selectivity estimators, i think i've got a
not-completely-ignorant 10,000 ft view of the trade-offs it tries to make, when
sequential scans are better than repeated index lookups, et cetera. i'm
writing because i found this example, which shows yet another thing i don't
understand about the query planner, and i am trying to learn better about it.
you've already answered my main question (whether or not unique constraints are
used to help row estimation.) there's a couple more issues which i don't quite
understand :
1) when i give a hint to the query planner to not expect more than
number-of-rows-in-jointable (via a limit), switches to a nested loop + index
scan, but with the same row estimates. i'll show the plan i had in the first
email :
Limit (cost=0.00..178452647.11 rows=2500000 width=28)
-> Nested Loop (cost=0.00..1306127545.35 rows=18297957 width=28)
-> Seq Scan on jointable (cost=0.00..35867.11 rows=2066911 width=24)
-> Index Scan using bigtable_pkey on bigtable (cost=0.00..631.88
rows=1 width=28)
Index Cond: ((id1 = jointable.id1) AND (id2 = jointable.id2) AND
(id3 = jointable.id3) AND (id4 = jointable.id4) AND (id5 = jointable.id5))
(5 rows)
before, i was misreading this as saying the planner was going to execute the
nested loop fully (rows=18 million), and then limit the results. i am now
reading it as saying that the inner nested loop will be short-circuited after
it generates enough rows. if this is true, it seems to imply that, in query
plan with deeply nested inner nested loops, one should read the inner loop row
estimates with a grain of salt, as there might be limits (potentially many
levels outwards) which can short-circuit them. am i wrong about this?
2) when doing the sort+merge join, it choses to sort bigtable rather than use
an index scan. i've tried to give hints by requesting the results come in
primary key order, but it keeps sorting by a permutation of the primary key and
then resorting the join results at the end. so obviously the random seek cost
dominates the sequential read + sort (which i find surprising, but again i am
happy to be surprised.) that seems fine for a query which is going to touch
the whole table. but i can't seem to come up with a query which would ever
favor using an index scan. for example this :
explain select * from bigtable order by (id1, id2, id3, id4, id5) limit 1;
QUERY PLAN
--------------------------------------------------------------------------------------------
Limit (cost=91923132.04..91923132.04 rows=1 width=28)
-> Sort (cost=91923132.04..99842047.88 rows=3167566336 width=28)
Sort Key: (ROW(id1, id2, id3, id4, id5))
-> Seq Scan on bigtable (cost=0.00..76085300.36 rows=3167566336
width=28)
(4 rows)
(apologies bigtable has grown since i've first started this thread.) shouldn't
an index scan definitely be fastest here? you don't need to touch the whole
table or index. maybe there something i have misconfigured here?
best regards, ben
--
Sent via pgsql-performance mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance