Hi, I'm the lead developer on the project this concerns (forgive my newbiness on this list).

We tried a couple of scenarios with effective_cache_size=60000, cpu_index_tuple_cost=0.0001 and random_page_cost=2 with no change in the plan.

explain analyse select * from tablename where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1';
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on tablename (cost=0.00..348199.14 rows=1180724 width=91) (actual time=7727.668..36286.898 rows=579238 loops=1)
Filter: ((transaction_date >= '2003-09-01 00:00:00+00'::timestamp with time zone) AND (transaction_date < '2003-10-01 00:00:00+00'::timestamp with time zone))
Total runtime: 36625.351 ms


explain analyse select * from transactions_posted where product_id = 2;
------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on transactions_posted (cost=0.00..319767.95 rows=6785237 width=91) (actual time=0.091..35596.328 rows=5713877 loops=1)
Filter: (product_id = 2)
Total runtime: 38685.373 ms


The product_id alone gives a difference of a millions rows from estimate to actual, vs. the factor of 2 from the transaction_date.

Dan Manley

Tom Lane пишет:

Andrew Sullivan <[EMAIL PROTECTED]> writes:


The statistics on transaction_date and product_id are set to 1000. Everything is all analysed nicely. But I'm getting a poor plan,
because of an estimate that the number of rows to be returned is
about double how many actually are:





explain analyse select * from transactions_posted where
transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' and
product_id = 2;



Are the estimates accurate for queries on the two columns individually, ie ... where transaction_date >= '2003-9-1' and transaction_date < '2003-10-1' ... where product_id = 2

If so, the problem is that there's a correlation between
transaction_date and product_id, which the system cannot model because
it has no multi-column statistics.

However, given that the estimate is only off by about a factor of 2,
you'd still be getting the wrong plan even if the estimate were perfect,
because the estimated costs differ by nearly a factor of 3.

Given the actual runtimes, I'm thinking maybe you want to reduce
random_page_cost.  What are you using for that now?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster





---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

Reply via email to