> Sean Chittenden <[EMAIL PROTECTED]> writes:
> > Getting the planner to pick
> > using the index to filter out data inserted in the last 3 days over
> > doing a seq scan...  well, I don't know how you could do that without
> > changing the random_page_cost.
> This sounds a *whole* lot like a correlation issue.  If the data in
> question were scattered randomly in the table, it's likely that an
> indexscan would be a loser.  The recently-inserted data is probably
> clustered near the end of the table (especially if they're doing
> VACUUM FULL after data purges; are they?).  But the planner's
> correlation stats are much too crude to recognize that situation, if
> the rest of the table is not well-ordered.

Data isn't scattered randomly from what I can tell and is basically
already clustered just because the data is inserted linearly and
based off of time.  I don't think they're doing a VACUUM FULL after a
purge, but I'll double check on that on Monday when they get in.  Is
there an easy way of determining or setting a planner stat to suggest
that data is ordered around a column in a permanent way?  CLUSTER has
always been a one shot deal and its effects wear off quickly depending
on the way that data is inserted.  It seems as though that this would
be a circumstance in which preallocated disk space would be a win
(that way data wouldn't always be appended to the heap and could be
inserted in order, of most use for non-time related data: ex, some
non-unique ID).

> If their typical process involves a periodic data purge and then a
> VACUUM FULL, it might be worth experimenting with doing a CLUSTER on
> the timestamp index instead of the VACUUM FULL.  The CLUSTER would
> reclaim space as effectively as VACUUM FULL + REINDEX, and it would
> leave the table with an unmistakable 1.0 correlation ... which
> should tilt the planner towards an indexscan without needing a
> physically impossible random_page_cost to do it.  I think CLUSTER
> would probably be a little slower than VACUUM FULL but it's hard to
> be sure without trying.

Hrm, I understand what clustering does, I'm just not convinced that
it'll "fix" this performance problem unless CLUSTER sets some kind of
hint that ANALYZE uses to modify the way in which it collects
statistics.  Like I said, I'll let you know on Monday when they're
back in the shop, but I'm not holding my breath.  I know
random_page_cost is set to something physically impossible, but in
terms of performance, it's always been the biggest win for me to set
this puppy quite low.  Bug in the planner, or documentation
surrounding what this knob does, I'm not sure, but setting this to a
low value consistently yields good results for me.  Faster the drive,
the lower the random_page_cost value.  *shrug*

> That's one heck of a poor estimate for the number of rows returned.
> > ->  Seq Scan on mss_fwevent  (cost=0.00..223312.60 rows=168478 width=12) (actual 
> > time=24253.66..24319.87 rows=320 loops=1)

The stats for the columns are already set to 1000 to aid with
this... don't know what else I can do here.  Having the planner off by
as much as even half the actual size isn't uncommon in my experience.


Sean Chittenden

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?


Reply via email to