A previous posting mentioning clusters prompted me to revist some earlier tests done on clustered and unclustered data. It appears that currently ( 7.1beta5 ) the optimizer is unaware of any clustering on a table - how important is that ? To answer this question I used by "pet" data warehouse tables : Table "fact1" 3000000 rows ~ 350Mb Attribute | Type | Distribution -----------+---------+------------- d0key | integer | 3000 distinct values 0-9000 clustered d1key | integer | val | integer | filler | text | Index: fact1_pk ( d0key,d0key ) cluster "key" Table "fact2" 3000000 rows ~ 350Mb Attribute | Type | Distribution -----------+---------+------------- d0key | integer | 3000 distinct values 0-9000 uniformly spread d1key | integer | val | integer | filler | text | Index: fact2_pk ( d0key,d0key ) The sample queries used to shed some light on the nature of the difference are : firstly the index scan - explain select count(*) from fact1 where d0key between 200 and 279; Aggregate (cost=58664.62..58664.62 rows=1 width=0) -> Index Scan using fact1_pk on fact1 (cost=0.00..58598.72 rows=26360 width=0) and the sequential scan - explain select count(*) from fact1 where d0key between 200 and 280; Aggregate (cost=59020.73..59020.73 rows=1 width=0) -> Seq Scan on fact1 (cost=0.00..58954.00 rows=26693 width=0) and analogous versions for fact2 ( with the same execution plan ) On the unclustered table fact2 the optimizer correctly assess the time to switch between an index scan and an sequential scan - both queries take about 30 s. However on the clustered table fact1, the (same) choice results in a jump from1s for the index scan to 30s for the sequential scan. (this was the guts of the previous research... bear with me those of you who read the last article ) So how long should an index scan be used for ?, some experimentation led me to adjust the "where" clause in my queries to where d0key between 0 and 4500 This produces a query plan of : Aggregate (cost=62692.75..62692.75 rows=1 width=0) -> Seq Scan on fact1 (cost=0.00..58954.00 rows=1495498 width=0) coercing the optimizer with a brutal set of cpu_tuple_cost = 0.4 gives : Aggregate (cost=868673.82..868673.82 rows=1 width=0) -> Index Scan using fact1_pk on fact1 (cost=0.00..864935.08 rows=1495498 width=0) (note that these scan 1500000 rows, ie. half the data ) Testing these queries on fact1 gives run times af about 35s for both - thus it is worthwhile to keep using index scans of upto 50% of the ( clustered ) table data. I found this result interesting, as I was thinking more like 15-20% of the table data would be the limit. The answer to the original question ( finally ) is "it is pretty important", as knowlege of the clustering drastically changes the optimal access path. So what to do if you know you have clustered data ? ( either via cluster or "it just happens to go in that way" ). Tampering with the various *cost type parameters to encourage index scans seems to be the only solution (other sugestions welcome here), but tends to be too global in effect ( for example trying the previous query on (unclustered ) fact2 with cpu_tuple_cost=0.4 takes more that 300s - I got tired of waiting...) . So be careful out there... Cheers Mark ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])