Re: [HACKERS] Correlation in cost_index()

Manfred Koizar Thu, 14 Aug 2003 19:40:29 -0700

On Fri, 8 Aug 2003 11:06:56 -0700, Sean Chittenden
<[EMAIL PROTECTED]> wrote:
>[...] it'd seem as though an avg depth of
>nodes in index * tuples_fetched * (random_io_cost * indexCorrelation)
>would be closer than where we are now...


Index depth does not belong here because we walk down the index only
once per index scan not once per tuple.  It might be part of the
startup cost.

The rest of your formula doesn't seem right, too, because you get
higher costs for better correlation.  Did you mean
        random_io_cost * (1 - abs(indexCorrelation))?

FWIW, for small effective_cache_size max_IO_cost is almost equal to
tuples_fetched * random_page_cost.  So your formula (with the
corrections presumed above) boils down to ignoring
effective_cache_size and linear interpolation between 0 and
max_IO_cost.

>It's very possible that cost_index() is wrong, but it seems as though
>after some testing as if PostgreSQL _overly_ _favors_ the use of
>indexes:

Was this an unpatched backend?  What were the values of
effective_cache_size and random_page_cost?

># SET enable_seqscan = true; SET enable_indexscan = true;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > 
>'2002-10-01'::TIMESTAMP WITH TIME ZONE;
>INFO:  cost_seqscan: run_cost: 21472.687500
>        startup_cost: 0.000000
>
>INFO:  cost_index: run_cost: 21154.308116
>        startup_cost: 0.000000
>        indexCorrelation: 0.999729
>                                                                                    
> QUERY PLAN
>-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using report_user_cat_count_utc_date_id_idx on report_user_cat_count rucc 
>  (cost=0.00..21154.31 rows=705954 width=64) (actual time=91.36..6625.79 rows=704840 
> loops=1)
>   Index Cond: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)
> Total runtime: 11292.68 msec
>(3 rows)

"actual time=91.36..6625.79" but "Total runtime: 11292.68 msec"!
Where did those 4.7 seconds go?

># SET enable_seqscan = true; SET enable_indexscan = false;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE utc_date > 
>'2002-10-01'::TIMESTAMP WITH TIME ZONE;
>INFO:  cost_seqscan: run_cost: 21472.687500
>        startup_cost: 0.000000
>
>INFO:  cost_index: run_cost: 21154.308116
>        startup_cost: 100000000.000000
>        indexCorrelation: 0.999729
>                                                              QUERY PLAN
>---------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on report_user_cat_count rucc  (cost=0.00..21472.69 rows=705954 width=64) 
> (actual time=1091.45..7441.19 rows=704840 loops=1)
>   Filter: (utc_date > '2002-10-01 00:00:00-07'::timestamp with time zone)
> Total runtime: 10506.44 msec
>(3 rows)

Same here: "actual time=1091.45..7441.19" but "Total runtime: 10506.44
msec"  -  more than 3 seconds lost.

When we ignore total runtime and look at actual time we get

             seq    idx
estimated   21473  21154
actual       7441   6626

This doesn't look too bad, IMHO.

BTW, I believe that with your example (single-column index, almost
perfect correlation, index cond selects almost all tuples) all
interpolation methods give an index cost estimation that is very close
to seq scan cost, and the actual runtimes show that this is correct.

>Which I find surprising and humorous given the popular belief is, mine
>included, contrary to those results.

How many tuples are in report_user_cat_count?  What are the stats for
report_user_cat_count.utc_date?

>  I can say with pretty high
>confidence that the patch to use a geometric mean isn't correct after
>having done real world testing as its break even point is vastly
>incorrect and only uses an index when there are less than 9,000 rows
>to fetch, a far cry from the 490K break even I found while testing.

Could you elaborate, please.  The intention of my patch was to favour
index scans more than the current implementation.  If it does not, you
have found a bug in my patch.  Did you test the other interpolation
methods?

>What I did find interesting, however, was that it does work better at
>determining the use of multi-column indexes,

Yes, because it computes the correlation for a two-column-index as
        correlation_of_first_index_column * 0.95
instead of
        correlation_of_first_index_column / 2

> but I think that's
>because the geometric mean pessimizes the value of indexCorrelation,
>which gets pretty skewed when using a multi-column index.

I don't understand this.

># CREATE INDEX report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count 
>(user_id,utc_date);
># CLUSTER report_user_cat_count_utc_date_user_id_idx ON report_user_cat_count;
># ANALYZE report_user_cat_count;
># SET enable_seqscan = true; SET enable_indexscan = true;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 
>AND utc_date > '2002-01-01'::TIMESTAMP WITH TIME ZONE;
>INFO:  cost_seqscan: run_cost: 23685.025000
>        startup_cost: 0.000000
>
>INFO:  cost_index: run_cost: 366295.018684
>        startup_cost: 0.000000
>        indexCorrelation: 0.500000
                           ^^^
This is certainly not with my patch.  The current implementation gives
ca. 366000 for pages_fetched = 122000 and random_page_cost = 4, which
looks plausible for 133000 tuples and (too?) small
effective_cache_size.

>                                                             QUERY PLAN
>------------------------------------------------------------------------------------------------------------------------------------
> Seq Scan on report_user_cat_count rucc  (cost=0.00..23685.03 rows=133918 width=64) 
> (actual time=0.28..6100.85 rows=129941 loops=1)
>   Filter: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp with 
> time zone))
> Total runtime: 6649.21 msec
>(3 rows)
>
># SET enable_seqscan = false; SET enable_indexscan = true;
>SET
>SET
># EXPLAIN ANALYZE SELECT * FROM report_user_cat_count AS rucc WHERE user_id < 1000 
>AND utc_date > '2002-01-01'::TIMESTAMP WITH TIME ZONE;
>INFO:  cost_seqscan: run_cost: 23685.025000
>        startup_cost: 100000000.000000
>
>INFO:  cost_index: run_cost: 366295.018684
>        startup_cost: 0.000000
>        indexCorrelation: 0.500000
>                                                                                      
>  QUERY PLAN
>-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using report_user_cat_count_utc_date_user_id_idx on report_user_cat_count 
> rucc  (cost=0.00..366295.02 rows=133918 width=64) (actual time=53.91..3110.42 
> rows=129941 loops=1)
>   Index Cond: ((user_id < 1000) AND (utc_date > '2002-01-01 00:00:00-08'::timestamp 
> with time zone))
> Total runtime: 3667.47 msec
>(3 rows)

Which shows that Postgres does not "_overly_ _favor_ the use of
indexes".

>If I manually set the indexCorrelation to 1.0, however, the planner
>chooses the right plan on its own

Ok, with indexCorrelation == 1.0 we dont have to discuss interpolation
methods, because they all return min_IO_cost.

>Which suggests to me that line 3964 in
>./src/backend/utils/adt/selfuncs.c isn't right for multi-column
>indexes, esp for indexes that are clustered.

Agreed.

>  I don't know how to
>address this though...

I guess there is no chance without index statistics.

>FWIW, this is an old data/schema from a defunct project that I can
>give people access to if they'd like. 

Is there a dump available for download?

Servus
 Manfred

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

Re: [HACKERS] Correlation in cost_index()

Reply via email to