[
http://issues.apache.org/jira/browse/DERBY-407?page=comments#action_12316672 ]
A B commented on DERBY-407:
---------------------------
If you trim out the extra lines of the diff, the failure comes down to the
following difference in query plan (this is just one tiny part of a very large
query plan):
Master file --
Left result set = Table scan on CLASSIFICATION_VALUES (chosen by the
optimizer)
Right result set = Hash scan on REPOSITORYOBJECTRESOURCE
Actual (when the test fails) --
Left result set = Index scan on REPOSITORYOBJECTRESOURCE (chosen by the
optimizer)
Right result set = Hash scan on CLASSIFICATION_VALUES
I don't know enough about how the optimizer works to know if one of these plans
is inherently "better" than the other (I think it depends on how many rows are
in the underlying tables). However, I have confirmed (through discussion with
Mike Matrigali) that the plan chosen by the optimizer _can_ be affected by the
speed of the machine. Internally, the optimizer does a check to make sure that
it doesn't spend too much time trying to find the best query plan--and it will
do a "timeout" if it thinks it's taking too long. In particular, if the
optimizer has already spent more time choosing a query plan than it thinks the
best query plan so far is going to take to execute, the optimizer will stop
looking at plans and just use the best plan so far.
What this means is that, on a faster machine, the optimizer can potentially see
more query plans before timing out than it would on a slower machine--and thus
it might find a better plan on the faster machine.
So far as I can tell, that's what's happening with this predicatesIntoViews
failure. While trying to prove this point, I noticed that there is an
undocumented user property called "derby.optimizer.noTimeout" that keeps the
optimizer from timing out--i.e. the optimizer will process ALL query plans and
then pick the best one, regardless of how long it takes to do so. When I set
this property to true and ran the test, it passed 100% of the time, regardless
of the speed of my CPU.
That said, I think an initial work-around for this problem is to set the
derby.optimizer.noTimeout property for this test to true (using the harness).
That should allow the test to pass on a consistent basis until someone can look
further at the massive query plan to see what's happening and to determine if
this particular failure is an optimizer bug (see notes below) or is just
'working as designed'.
-----------
Other notes
-----------
While investigating this failure, I noticed a couple of behavioral oddities
that could potentially help to determine if this is an optimizer bug. I'm
including them here for tracking purposes. Note that these behaviors are only
relevant if the noTimeout property is NOT set (if it _is_ set, then the test
will always pass, regardless of what I describe below).
1) In an attempt to debug the failure, I added a bunch of "select count(*)"
queries to the test to see how large the tables used by the query plan were,
and also to see how large the result set of the query was. As it turned out,
adding such queries caused the test to consistently choose the second query
plan listed above--i.e. the "fail" plan--regardless of the speed of my CPU.
The specific queries I added are these (I added them just before execution of
the query that's seeing the diff):
SELECT COUNT(*) FROM XR.CLASSIFICATION_VALUES;
SELECT COUNT(*) FROM XR.REPOSITORYOBJECTRESOURCE;
SELECT COUNT(*) FROM XR.REPOSITORYOBJECTALLVERSIONVIEW;
SELECT count(*)
FROM xr.repositoryobjectallversionview rov
where (uname = UPPER('two') or uname = UPPER('my project'))
and (versionid in
(select versionid
from xr.versionlabel
where UPPER(label) = UPPER('Snapshot')))
and deletedate is null;
It would appear that executing these SELECT COUNT(*) queries causes some
internal state/row counts to be updated, which in turn results in a different
query plan than what the test usually expects (even on faster machines). But
if that's the case, then it seems (to me) like the updated state/row counts
should cause the optimizer to choose a _better_ plan than what it would
otherwise choose--so it's not clear to me what's going on there. Perhaps the
updated state/row counts are incorrect or are being handled incorrectly,
causing the optimizer to choose the wrong query plan...?
2) If I leave noTimeout set to false (its default) and do NOT add the "select
count(*)" queries mentioned above, but I add system shutdown/reconnect logic to
the test just before executing the query in question, then the test passes
regardless of how fast/slow my CPU is. Mike Matrigali suggested that this is
because system shutdown causes all state/row counts to be flushed to disk, so
reconnecting and then executing the query might allow the optimizer to use the
latest state/row counts to find the best plan. Of course, if that's true then
it seems like we'd have at least a slight contradiction to the behavior
described in #1 above, where updated stats/row counts are causing the test to
_fail_...
Note that if the "select count(*)" queries are included in the test, then the
test will fail regardless of whether or not we shutdown/reconnect.
The interaction between these behaviors and the noTimeout flag can be
summarized as follows.
Let "TO" correspond to a noTimeout property that is TRUE.
Then "!TO" is a noTimeout property that is FALSE (the
default).
Let "SC" correspond to the presence of "select count(*)"
queries in the test. Then "!SC" means that such queries
are NOT included.
Let "SH" correspond to the presence of shutdown/reconnect
logic in the test. Then "!SH" means that such logic is
is NOT included.
Then we have the following behavior (at least, on my own laptop):
!TO && !SC && !SH ==> Intermittent failure on slower machines.
!TO && !SC && SH ==> Test passes regardless of CPU speed.
!TO && SC && !SH ==> Test fails regardless of CPU speed.
!TO && SC && SH ==> Test fails regardless of CPU speed.
TO && !SC && !SH ==> Test passes regardless of CPU speed.
TO && !SC && SH ==> Test passes regardless of CPU speed.
TO && SC && !SH ==> Test passes regardless of CPU speed.
TO && SC && SH ==> Test passes regardless of CPU speed.
Again, I'm not sure how much of this information is relevant info, but since
these are all things I noticed while investigating the problem, I'm including
them for the sake of completeness/tracking, in case they'll save someone some
time in the future...
In the end, since setting noTimeout to TRUE seems like the simplest and most
reliable way to get the test to pass on a consistent basis, that is what I'm
proposing as a tempory workaround until further investigation can be carried
out. I will post a patch to do that shortly.
> predicatesIntoViews test failure on slow machine in Derby 10.1 branch version
> 201931
> ------------------------------------------------------------------------------------
>
> Key: DERBY-407
> URL: http://issues.apache.org/jira/browse/DERBY-407
> Project: Derby
> Type: Test
> Components: Test
> Versions: 10.1.1.0
> Environment: Java Version: 1.4.2_04
> Java Vendor: Hewlett-Packard Company
> Java home: /usr/tandem/java/jre
> Java classpath:
> /home/john/j2ee/derbyTests/derby-10.1-branch/jars/derby.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbytools.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbynet.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyclient.jar:/home/john/j2ee/derbyTests/common-jars/db2jcc.jar:/home/john/j2ee/derbyTests/common-jars/db2jcc_license_c.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyTesting.jar:/home/john/j2ee/derbyTests/common-jars/jakarta-oro-2.0.8.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_de_DE.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_es.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_fr.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_it.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_ja_JP.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_ko_KR.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_pt_BR.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_zh_CN.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_zh_TW.jar:/usr/tandem/java/lib/classes.zip:/usr/tandem/javaextv20/lib/tdmext.jar:/usr/tandem/jdbcMp/current/lib/sqlmp.jar:/usr/tandem/jdbcMx/current/lib/jdbcMx.jar:.
> OS name: NONSTOP_KERNEL
> OS architecture: mips
> OS version: G06
> Java user name: SUPER.JOHN
> Java user home: /home/john
> Java user dir: /home/john/j2ee/derbyTests/derby-10.1-branch/testOutput
> java.specification.name: Java Platform API Specification
> java.specification.version: 1.4
> Reporter: John Sisson
> Attachments: derbylang_report.txt
>
> The predicatesIntoViews test (part of the derbylang suite) fails when running
> the tests on a busy/slow system, but they pass when I run them on my fast
> Windows box.
> It appears that the failure could be due to different optimizations being
> performed due to the difference in speed of the machines. This needs to be
> confirmed that it is the case of the test failure. I will attach the test
> summary to this issue.
> I also noticed that the test report said 0% fail even though one test failed.
> All derbylang tests passed on Windows XP.
> Thanks,
> John
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira