On Fri, September 1, 2006 22:14, Gregory Stark wrote:

> I think the slow part is trying to figure out whether to count the current
> call as a hit or a miss. How do you determine whether the plan you're
> running
> is the best plan without replanning the query?

The question of knowing which plan is best _based on what's in the actual
tables_ would be unsolved just as it always was.  The scheme addresses
only the opportunity to optimize for pseudo-constant parameters.  It
treats the existing planner as a black box.  If you find a solution to the
problem of inaccurate statistics, it'll probably be more or less
orthogonal to what I'm describing: you could have one or the other, but
combining them shouldn't be much harder.

I don't think telling hits from misses would be all that hard.  Let's say
you're having a prepared statement called, and you're evaluating a
candidate plan.  Each parameter is in one of two sets: those "predicted"
by the plan to have certain values (let's call them P), and those "not
predicted" by the plan because their confidence counters were below the
threshold (I'm tempted to call this set NP, but let's make it Q instead). 
Whether a parameter is in P or in Q can be derived from its confidence
counter.  In my previous example, you just take its most-significant bit.

 * For any parameter in P, if the actual value does not match the plan's
prediction, you have a miss.  Can't use this plan.  Use another if you
have one that applies (such as your regular old non-specialized
plan--that always applies), or if not, write a new one!

If you get through this without finding a mismatch, congratulations: you
have a hit.  The plan you're looking at is applicable to your call.  But
now we see if we can do better:

 * For any parameter in Q, if its value would have been predicted
correctly but its counter was below the confidence threshold, you
increment the counter.  If that lifts the counter above the threshold,
you have room for improving on this plan.  It means there's a good chance
you can re-plan for the case that this parameter is also a
pseudo-constant, without the effort being wasted.  Of course you could
also set a minimum number of invocations between re-plannings to get a
more long-term view (e.g. different parameters being recognized as
pseudo-constants in subsequent calls--you may not want to re-plan for
each of those calls).

So which plan do you execute if you have more than one applicable
candidate?  We can see what works well.  As a starter I would definitely
pick the one with the larger P (smaller Q), breaking ties in favour of the
most recently generated plan.  I'm assuming we only want one plan for a
given P.

We'd probably want to limit the number of candidate plans per statement to
some very small, fixed number--somewhere between one and four, I'd say; or
maybe one generalized plan plus up to two specialized ones.  With numbers
like that, none of this should be very expensive.  A simple linear match
against 1-4 candidates may be more effective than any attempt to be
clever.

I must admit I haven't thought through all of the consequences of caching
more than one specialized plan per statement.  For example, we could give
every cached plan its own set of confidence counters, and match an
incoming invocation against each of those; or we could keep just one "most
likely" plan with its associated predictor state, and only consider
previously generated plans if we either miss or find room for improvement
in the predictor.


Jeroen



---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
       subscribe-nomail command to [EMAIL PROTECTED] so that your
       message can get through to the mailing list cleanly

Reply via email to