Hi Brent,

Thanks for the extra information. And thanks to Army for the pointers to extra documentation. Hopefully our optimizer guru, Jeff Lichtman, will join the discussion, but he may be on vacation. Please bear with me as I try to understand your approach. I am going to use some shorthand to describe the nodes in the query plan and not use the actual class names in trunk/java/engine/org/apache/derby/impl/sql/execute, which can be a bit obscure. I hope this doesn't confuse the issue.

It sounds to me from the bug description that the Optimizer is choosing the following plan

Plan 1:

 TableScan

and that you would like to teach the Optimizer to choose this plan, instead:

Plan 2:

               Join
               /     \
 IndexScan   TableScan

I am not sure that Plan 2 will perform much better on the hard case in the bug report. I'm afraid that 10,000 rows may still be scanned out of the index before being filtered and passed up to the Join node. It's quite possible that the Optimizer did, in fact, consider this plan and rejected it because it actually was more expensive. It looks to me as though the IndexScan only has start and end qualifiers and doesn't have the concept of nextKey. What you want to end up with, I think, is a plan that probes the index for a small number of key values rather than scanning it end to end.

It would be interesting to build nextKey support into the IndexScan so that the scan could be repositioned for IN lists. Somebody please correct me if the support is already there and I'm just not seeing it. Lacking that support, it might be simpler to build a plan out of the existing tinker-toys. Something like the following might give you the probing you need:

Plan 3:

                    Join
                     /   \
              Union  TableScan
              /     \
 IndexScan   IndexScan

Each IndexScan would probe for a single key value. If you could just get the Opimizer to consider this plan, then I'm cautiously optimistic that the existing cost-based decisions would favor Plan 3 over Plans 1 and 2.

I hope this is helpful.

Regards,
-Rick


Brent Verner wrote:

Hiya Rick,

  Thanks for you helpful reply!  I spent some time over the weekend
snooping around (mostly) InListOperatorNode and OptimizerImpl, but never found out _where_ the table scan was chosen over an index scan
for finding the IN matches...I really wanted to find it before begging
for someone to point out the obvious :-) Specifically, I got all tangled up around the dynamically generated classes which implement the chosen query plan. Also, I didn't ever find any QueryPlan-like class, but I guess this what the dynamic class(es) implement.


 What did I want to do once I got there?  Pseudo-code says it best:

   // ISTR that the bound IN values are ordered and unique.  If not,
   // this should be done...

   if( column is indexed ){
       if( index is unique ){
           // this should be a guaranteed win.  We know we'll only get
           // one match per IN value
           do index scan
       }
       else if( table size > SOME_MAGIC_SIZE ){
           // SOME_MAGIC_SIZE would represent the size of (the) table
           // at which using the index would be reasonably known to
           // be more expensive than scanning the table.
           do index scan
       }
       else {
           do table scan
       }
   }
   else {
       do table scan
   }


 Is that a workable approach (excepting the SOME_MAGIC_SIZE bit)?

I'll probably have some time later this week to immerse myself in locating the magic location to apply this logic. A _little_ pointer would be helpful: maybe a brief overview of how a query plan is determined -- I
suspect his is somewhere around where trulyTheBestTableAccess (or whatever
it's really called :-)) is defined. I'd rather have to bang my head against this wall a bit, to get a better understanding of things, instead
of being pointed immediately to the solution -- my own version of smelling
the roses, I guess ;-)

Cheers!
 Brent


[2006-06-19 11:30] Rick Hillegas said:
| Hi Brent,
| | Sounds like you're off to a good start. From the initial bug report, it | looks like there's a good idea about which heuristic is being | mis-applied here. Once you've studied the optimizer papers, I recommend | that you post some high-level candidate solutions. Try to avoid | optimizer jargon and concentrate on simple descriptions: | | o What query plan would you rather see? | o What heuristic would the optimizer apply that would lead it to your | preferred plan? | o How would the optimizer decide to apply the new heuristic rather than | the old one? | | I think you'll get some good feedback if your post contains the phrase | "Attention, optimizer experts." I think that the optimizer enthusiasts | on the list will give you good feedback: | | o Maybe they can think of a better query plan or heuristic.
| o Maybe they can see some awkward corner cases in your  heuristic.
| o They can advise you on whether your heuristic will short-circuit other | optimizer choices. | o They can advise you on whether your heuristic will cause an explosion | in the time that the optimizer takes. | | Thanks for wanting to scratch this itch! | | Regards,
| -Rick
| | Brent Verner wrote: | | >Hi,
| >
| >    I've recently found need for an embedded java db and only Derby seems
| >even close to handling the task, however the broken query planning for IN
| >clauses makes it unusable :-(.  I've decided to eschew an embedded db
| >in favor of PG for now, but I'd really like to be able to use Derby in the
| >near(ish) future for deployment.
| >
| >    I'd like to try to fix the query planning around the IN clause.  I'm
| >reviewing the internals papers right now, but I'd appreciate any input
| >that might point me in the right direction :-).
| >
| >cheers!
| > Brent
| >
| > | >

Reply via email to