While working on changes for DERBY-805 and DERBY-1007, I noticed that the
"timeout" mechanism for the Derby optimizer is never reset for subqueries. As
is usually required when talking about the optimizer, let me give an example of
what I mean.
If we have a query such as:
select <...> from
(select t1.i, t2.j from t1, t2 where <...>) X1,
T3
where <...>
then we would have one "outer" query and one "subquery". The outer query would
be "select <...> from X1, T3", the subquery would be "select t1.i, t2.j from t1,
t2".
In this case Derby will create two instances of OptimizerImpl: one for the outer
query (call it OI_OQ) and one for the subquery (call it OI_SQ). Each
OptimizerImpl has its own timeout "clock" that it initializes at creation
time--but never resets. If timeout occurs, the OptimizerImpl will stop
searching for "the" best plan and will just take the best plan found so far.
That said, for every permutation of the outer query a call will be made to
optimize the subquery. To simplify things, let's assume there are only two
permutations of the outer query: one with join order {X1, T3} and another with
join order {T3, X1}.
Now let's say we're looking at the first permutation {X1, T3}. OI_OQ will make
a call to optimize the subquery represented by OI_SQ. Let's further say that
the subquery tries some permutation {T1, T2} and then times out. It then
returns the plan information for {T1, T2} to the outer query. The outer query,
which has *not* yet timed out, then decides to try its second permutation {T3,
X1}. So it again makes a call to optimize the subquery. In this case, the
subquery--which has already timed out--will *immediately* return without trying
to optimize anything. The outer query will then make a decision about its
second permutation based on the un-optimized subquery's plan results.
My question is: is this intentional behavior?
On the one hand, I can sort of see the logic in not trying to optimize the
subquery after the first time, because if we've "timed out" then we just want to
"wrap things up" as quickly as possible.
On the other hand, the outer query--which is the query representing the
top-level statement that the user is executing--has *not* yet timed out, so it
seems to me like the second call to optimize the subquery should get a "clean
start" instead of timing out right away.
This hasn't really been an issue to date because the "best plan" chosen by the
subquery is typically independent of the outer query's current permutation--with
the exception of "outerCost", which is passed in from the outer query and is
factored into the subquery's cost estimates. Because of this relative
independence, the plan chosen by the subquery would rarely (if ever?) change
with different permutations of the outer query, so if the subquery timed out
once there was no point in trying to re-optimize it again later.
With my changes for DERBY-805, though, I'm introducing the notion of pushing
predicates from outer queries down into subqueries--which means that the outer
join order can have a very significant impact on the plan chosen by the
subquery. But because the timeout mechanism is never reset, we could end up
skipping the second optimization phase of the subquery, which means we never get
a chance to see how much the outer predicates can help, and thus we could end up
skipping over some plans that have the potential to give us significant
performance improvement.
It's hard to come up with a concrete example of this because it depends on
optimizer timeout, which varies with different machines. But in my work with
DERBY-805 I've seen cases where the optimizer ignores pushed predicates because
of subquery timeout and thus ends up choosing a (hugely) sup-optimal plan.
All of that said, in an attempt to resolve this issue I added the following two
lines to the "prepForNextRound()" method in OptimizerImpl (that method was only
recently added as part of a change for DERBY-805):
+ /* Reset timeout state since we want to timeout on a per-round
+ * basis. Otherwise if we timed out during a previous round and
+ * then we get here for another round, we'll immediately
+ * "timeout" again before optimizing any of the Optimizables--
+ * which means we could be missing out on some helpful
+ * optimizations that weren't available for the previous
+ * round (esp. use of indexes/qualifiers that are made possible
+ * by predicates pushed down from the outer query). So by
+ * resetting the timeout state here, we give ourselves a chance
+ * to do some legitimate optimization before (potentially) timing
+ * out again. This could allow us to find some access plans that
+ * are significantly better than anything we found in previous
+ * rounds.
+ */
+ timeOptimizationStarted = System.currentTimeMillis();
+ timeExceeded = false;
I applied this change locally in combination with the DERBY-1007 patch and ran
derbyall on Linux Red Hat with IBM 1.4.2 with no failures.
Does anyone out there know if this is a safe change? I would appreciate any
feedback/advice here. If I don't hear any objections, I'll file a sub-task to
DERBY-805 for this issue and post the above patch there.
Many thanks in advance to anyone who might have input/feedback here.
Army