On Thu, May 19, 2005 at 09:31:47AM -0700, Josh Berkus wrote:
> > In the
> > case of testing index scans, we need to be able to vary correlation,
> > which so far I've been doing by ordering by different columns. I suspect
> > it will also be important to test with different tuple sizes. There's
> > also the question of whether or not the cache should be flushed for each
> > run or not.
> > Does this sound like a good way to determine actual costs for index
> > scans (and hopefully other access methods in the future)? If so, what
> > would be a good way to implement this?
> Well, the problem is that what we need to index scans is a formula, rather
> than a graph. The usefulness of benchmarking index scan cost is so that we
True, but having a graphical representation of how different input
variables (such as correlation) affect runtime is a good way to derive
such a formula, or at least point you in the right direction.
> can test our formula for accuracy and precision. However, such a formula
> *does* need to take into account concurrent activity, updates, etc ... that
> is, it needs to approximately estimate the relative cost on a live database,
> not a test one.
Well, that raises an interesting issue, because AFAIK none of the cost
estimate functions currently do that. Heck, AFAIK even the piggyback seqscan
code doesn't take other seqscans into account.
Another issue is: what state should the buffers/disk cache be in? In the
thread that kicked all this off Tom noted that my results were skewed
because of caching, so I changed my tests to flush the disk cache as
effectively as I could (by running a program that would consume enough
available memory to just start the box swapping), but I don't think
that's necessarily realistic. Though at least it should preclude the
need to run tests multiple times on an otherwise idle box in order to
'pre-seed' the cache (not that that's any more realistic). If you don't
use one of these techniques you end up with results that depend on what
test was run before the current one...
> This is also going to be a moving target because Tom's in-memory-bitmapping
> changes relative cost equations.
I thought those all had seperate costing functions...? In any case, if
we have a cost estimation tool it will make it much easier to derive
cost estimation functions.
> I think a first step would be, in fact, to develop a tool that allows us to
> put EXPLAIN ANALYZE results in a database table. Without that, there is no
> possibility of statistical-scale analysis.
Rather than trying to parse all possible output, ISTM it would be much
better if there was a way to access the info directly. Would it be
difficult to have an option that produces output that is a set of
different fields? I'm thinking something like:
Level (basically how far something's indented)
Parent node (what node a child node is feeding)
node_id (some kind of identifier for each step)
other (something to hold index condition, filter, etc)
But ultimately, I'm not sure if this is really required or not, because
I don't see that we need to use explain when running queries. In fact,
it's possibly desireable that we don't, because of the overhead it
incurs. We would want to log an explain (maybe analyze) just to make
sure we knew what the optimizer was doing, but I think we shouldn't need
the info to produce cost estimates.
Jim C. Nasby, Database Consultant [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828
Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"
---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]