Re: [PERFORM] index v. seqscan for certain values

Jeremy Dunn Mon, 12 Apr 2004 12:12:48 -0700

> "Jeremy Dunn" <[EMAIL PROTECTED]> writes:
> > The question: why does the planner consider a sequential scan to be 
> > better for these top 10 values?
> 
> At some point a seqscan *will* be better.  In the limit, if 
> the key being sought is common enough to occur on every page 
> of the table, it's certain that a seqscan will require less 
> I/O than an indexscan (because reading the index isn't 
> actually saving you any heap fetches). In practice the 
> breakeven point is less than that because Unix kernels are 
> better at handling sequential than random access.
> 
> Your gripe appears to be basically that the planner's idea of 
> the breakeven point is off a bit.  It looks to me like it's 
> within about a factor of 2 of being right, though, which is 
> not all that bad when it's using generic cost parameters.


Agreed.  However, given that count(*) is a question that can be answered
_solely_ using the index (without reference to the actual data blocks),
I'd expect that the break-even point would be considerably higher than
the < 3% (~38,000 / ~1.3M) I'm currently getting.  Does PG not use
solely the index in this situation??

> > A) alter table xxx alter column cid set statistics 500;    
> >     analyze xxx;
> > This does not affect the results.
> 
> It probably improved the accuracy of the row count estimates, 
> no? The estimate you show for cid=7191032 is off by more than 
> 25% (37765 vs 50792), which seems like a lot of error for one 
> of the most common values in the table.  (I hope that was 
> with default stats target and not 500.)  That leads directly 
> to a 25% overestimate of the cost of an indexscan, while 
> having IIRC no impact on the cost of a seqscan. Since the 
> cost ratio was more than 25%, this didn't change the selected 
> plan, but you want to fix that error as best you can before 
> you move on to tweaking cost parameters.

Actually it made them worse!  Yes, this was the default statistics (10).
When I just tried it again with a value of 300, analyze, then run the
query, I get a *worse* result for an estimate.  I don't understand this.


   alter table xxx alter column cid set statistics 300;
   analyze emailrcpts;
   set random_page_cost to 2;
   explain analyze select count(*) from xxx where cid=7191032;

   Aggregate  (cost=20563.28..20563.28 rows=1 width=0) (actual
time=7653.90..7653.90 rows=1 loops=1)
  ->  Index Scan using xxx_cid on xxx  (cost=0.00..20535.82 rows=10983
width=0) (actual time=72.24..7602.38 rows=37765 loops=1)
   Total runtime: 7654.14 msec

Now it estimates I have only 10,983 rows (~3x too low) instead of the
old estimate 50,792 (1.3x too high).  Why is that ??

Anyway, a workable solution seems to be using a lower value for
Random_Page_Cost.  Thanks to everyone who replied with this answer. 

> Also it is likely appropriate to increase 
> effective_cache_size, which is awfully small in the default 
> configuration.  I'd set that to something related to your 
> available RAM before trying to home in on a suitable random_page_cost.

We have ours set to the default value of 1000, which does seem low for a
system with 1GB of RAM.  We'll up this once we figure out what's
available.  Then tweak the Random_Page_Cost appropriately at that point.

I'd still like to understand the strangeness above, if anyone can shed
light.

- Jeremy


---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

Re: [PERFORM] index v. seqscan for certain values

Reply via email to