Hi Brent,
Thanks for the extra information. And thanks to Army for the pointers to
extra documentation. Hopefully our optimizer guru, Jeff Lichtman, will
join the discussion, but he may be on vacation. Please bear with me as I
try to understand your approach. I am going to use some shorthand to
describe the nodes in the query plan and not use the actual class names
in trunk/java/engine/org/apache/derby/impl/sql/execute, which can be a
bit obscure. I hope this doesn't confuse the issue.
It sounds to me from the bug description that the Optimizer is choosing
the following plan
Plan 1:
TableScan
and that you would like to teach the Optimizer to choose this plan, instead:
Plan 2:
Join
/ \
IndexScan TableScan
I am not sure that Plan 2 will perform much better on the hard case in
the bug report. I'm afraid that 10,000 rows may still be scanned out of
the index before being filtered and passed up to the Join node. It's
quite possible that the Optimizer did, in fact, consider this plan and
rejected it because it actually was more expensive. It looks to me as
though the IndexScan only has start and end qualifiers and doesn't have
the concept of nextKey. What you want to end up with, I think, is a plan
that probes the index for a small number of key values rather than
scanning it end to end.
It would be interesting to build nextKey support into the IndexScan so
that the scan could be repositioned for IN lists. Somebody please
correct me if the support is already there and I'm just not seeing it.
Lacking that support, it might be simpler to build a plan out of the
existing tinker-toys. Something like the following might give you the
probing you need:
Plan 3:
Join
/ \
Union TableScan
/ \
IndexScan IndexScan
Each IndexScan would probe for a single key value. If you could just get
the Opimizer to consider this plan, then I'm cautiously optimistic that
the existing cost-based decisions would favor Plan 3 over Plans 1 and 2.
I hope this is helpful.
Regards,
-Rick
Brent Verner wrote:
Hiya Rick,
Thanks for you helpful reply! I spent some time over the weekend
snooping around (mostly) InListOperatorNode and OptimizerImpl, but
never found out _where_ the table scan was chosen over an index scan
for finding the IN matches...I really wanted to find it before begging
for someone to point out the obvious :-) Specifically, I got all
tangled up around the dynamically generated classes which implement
the chosen query plan. Also, I didn't ever find any QueryPlan-like
class, but I guess this what the dynamic class(es) implement.
What did I want to do once I got there? Pseudo-code says it best:
// ISTR that the bound IN values are ordered and unique. If not,
// this should be done...
if( column is indexed ){
if( index is unique ){
// this should be a guaranteed win. We know we'll only get
// one match per IN value
do index scan
}
else if( table size > SOME_MAGIC_SIZE ){
// SOME_MAGIC_SIZE would represent the size of (the) table
// at which using the index would be reasonably known to
// be more expensive than scanning the table.
do index scan
}
else {
do table scan
}
}
else {
do table scan
}
Is that a workable approach (excepting the SOME_MAGIC_SIZE bit)?
I'll probably have some time later this week to immerse myself in locating
the magic location to apply this logic. A _little_ pointer would be
helpful: maybe a brief overview of how a query plan is determined -- I
suspect his is somewhere around where trulyTheBestTableAccess (or whatever
it's really called :-)) is defined. I'd rather have to bang my head
against this wall a bit, to get a better understanding of things, instead
of being pointed immediately to the solution -- my own version of smelling
the roses, I guess ;-)
Cheers!
Brent
[2006-06-19 11:30] Rick Hillegas said:
| Hi Brent,
|
| Sounds like you're off to a good start. From the initial bug report, it
| looks like there's a good idea about which heuristic is being
| mis-applied here. Once you've studied the optimizer papers, I recommend
| that you post some high-level candidate solutions. Try to avoid
| optimizer jargon and concentrate on simple descriptions:
|
| o What query plan would you rather see?
| o What heuristic would the optimizer apply that would lead it to your
| preferred plan?
| o How would the optimizer decide to apply the new heuristic rather than
| the old one?
|
| I think you'll get some good feedback if your post contains the phrase
| "Attention, optimizer experts." I think that the optimizer enthusiasts
| on the list will give you good feedback:
|
| o Maybe they can think of a better query plan or heuristic.
| o Maybe they can see some awkward corner cases in your heuristic.
| o They can advise you on whether your heuristic will short-circuit other
| optimizer choices.
| o They can advise you on whether your heuristic will cause an explosion
| in the time that the optimizer takes.
|
| Thanks for wanting to scratch this itch!
|
| Regards,
| -Rick
|
| Brent Verner wrote:
|
| >Hi,
| >
| > I've recently found need for an embedded java db and only Derby seems
| >even close to handling the task, however the broken query planning for IN
| >clauses makes it unusable :-(. I've decided to eschew an embedded db
| >in favor of PG for now, but I'd really like to be able to use Derby in the
| >near(ish) future for deployment.
| >
| > I'd like to try to fix the query planning around the IN clause. I'm
| >reviewing the internals papers right now, but I'd appreciate any input
| >that might point me in the right direction :-).
| >
| >cheers!
| > Brent
| >
| >
| >