While working on changes for DERBY-805 and DERBY-1007, I noticed that the "timeout" mechanism for the Derby optimizer is never reset for subqueries. As is usually required when talking about the optimizer, let me give an example of what I mean.

If we have a query such as:

select <...> from
  (select t1.i, t2.j from t1, t2 where <...>) X1,
   T3
where <...>

then we would have one "outer" query and one "subquery". The outer query would be "select <...> from X1, T3", the subquery would be "select t1.i, t2.j from t1, t2".

In this case Derby will create two instances of OptimizerImpl: one for the outer query (call it OI_OQ) and one for the subquery (call it OI_SQ). Each OptimizerImpl has its own timeout "clock" that it initializes at creation time--but never resets. If timeout occurs, the OptimizerImpl will stop searching for "the" best plan and will just take the best plan found so far.

That said, for every permutation of the outer query a call will be made to optimize the subquery. To simplify things, let's assume there are only two permutations of the outer query: one with join order {X1, T3} and another with join order {T3, X1}.

Now let's say we're looking at the first permutation {X1, T3}. OI_OQ will make a call to optimize the subquery represented by OI_SQ. Let's further say that the subquery tries some permutation {T1, T2} and then times out. It then returns the plan information for {T1, T2} to the outer query. The outer query, which has *not* yet timed out, then decides to try its second permutation {T3, X1}. So it again makes a call to optimize the subquery. In this case, the subquery--which has already timed out--will *immediately* return without trying to optimize anything. The outer query will then make a decision about its second permutation based on the un-optimized subquery's plan results.

My question is: is this intentional behavior?

On the one hand, I can sort of see the logic in not trying to optimize the subquery after the first time, because if we've "timed out" then we just want to "wrap things up" as quickly as possible.

On the other hand, the outer query--which is the query representing the top-level statement that the user is executing--has *not* yet timed out, so it seems to me like the second call to optimize the subquery should get a "clean start" instead of timing out right away.

This hasn't really been an issue to date because the "best plan" chosen by the subquery is typically independent of the outer query's current permutation--with the exception of "outerCost", which is passed in from the outer query and is factored into the subquery's cost estimates. Because of this relative independence, the plan chosen by the subquery would rarely (if ever?) change with different permutations of the outer query, so if the subquery timed out once there was no point in trying to re-optimize it again later.

With my changes for DERBY-805, though, I'm introducing the notion of pushing predicates from outer queries down into subqueries--which means that the outer join order can have a very significant impact on the plan chosen by the subquery. But because the timeout mechanism is never reset, we could end up skipping the second optimization phase of the subquery, which means we never get a chance to see how much the outer predicates can help, and thus we could end up skipping over some plans that have the potential to give us significant performance improvement.

It's hard to come up with a concrete example of this because it depends on optimizer timeout, which varies with different machines. But in my work with DERBY-805 I've seen cases where the optimizer ignores pushed predicates because of subquery timeout and thus ends up choosing a (hugely) sup-optimal plan.

All of that said, in an attempt to resolve this issue I added the following two lines to the "prepForNextRound()" method in OptimizerImpl (that method was only recently added as part of a change for DERBY-805):

+       /* Reset timeout state since we want to timeout on a per-round
+        * basis.  Otherwise if we timed out during a previous round and
+        * then we get here for another round, we'll immediately
+        * "timeout" again before optimizing any of the Optimizables--
+        * which means we could be missing out on some helpful
+        * optimizations that weren't available for the previous
+        * round (esp. use of indexes/qualifiers that are made possible
+        * by predicates pushed down from the outer query).  So by
+        * resetting the timeout state here, we give ourselves a chance
+        * to do some legitimate optimization before (potentially) timing
+        * out again.  This could allow us to find some access plans that
+        * are significantly better than anything we found in previous
+        * rounds.
+        */
+       timeOptimizationStarted = System.currentTimeMillis();
+       timeExceeded = false;

I applied this change locally in combination with the DERBY-1007 patch and ran derbyall on Linux Red Hat with IBM 1.4.2 with no failures.

Does anyone out there know if this is a safe change? I would appreciate any feedback/advice here. If I don't hear any objections, I'll file a sub-task to DERBY-805 for this issue and post the above patch there.

Many thanks in advance to anyone who might have input/feedback here.
Army

Reply via email to