Hi Harshad,
Some comments inline...
Harshad wrote:
Rick Hillegas wrote:
Depending on your data, the two queries could return different results,
so Bryan's faster query is not a transformation which Derby would
perform automatically.
Oh yes! Although my data ensures that the subquery would return distinct
results, Derby wouldn't know about it. Is there a way to say that the
combination of two columns is unique?
When I tried to specify DISTINCT explicitly like this:
select name,version,release,time from PKG where PKG.id in (select DISTINCT
PROVIDES.id from PROVIDES where (PROVIDES.name = ?))
...I didn't get any change in the performance figures. That is probably because of the
"DISTINCT elimination in IN sub-queries" mentioned in "Tuning derby".
I would have thought that the DISTINCT is not redundant and actually allows for
the transformation into a normal join.
I think that, for instance, a problem arises when there are M tuples in
PROVIDES with name='Fred' but there are < M distinct ids in those
tuples. In that case, you can imagine data sets on which the original
query returns < M tuples but the transformed query returns M tuples.
Nevertheless, it looks to me as though your
original query almost qualifies for the EXISTS transformation documented
in the Derby Tuning Guide.
Why do you say "almost qualifies"; where is it not qualifying? And, will the
EXISTS transformation speed up the query as much as the transformation to a normal join?
I haven't looked under the hood here. The Tuning Guide doesn't
completely describe the optimizer's reasoning. A smart optimizer won't
unconditionally flatten subqueries.
Without having looked at the code (but knowing who wrote it), I suspect
that the EXISTS transformation may be a normal join combined with a
tuple filter which throws away duplicates on a sorted leg of the join.
It may be that Derby did apply the EXISTS transformation--and that
degraded rather than improved the performance. If you are interested in
seeing the plan which Derby selected, please consult the following wiki
pages:
http://wiki.apache.org/db-derby/PerformanceDiagnosisTips
http://wiki.apache.org/db-derby/StmtExecutionPlan
I realize that I'm raising more questions than I'm answering.
Hope this is a little useful,
-Rick
Thanks to all who have responded on this thread.
I have learnt not to trust the execution time of Statement.executeQuery(). It
looks like the actual table access is made lazily (as and when the result set
is accessed).