[ 
http://issues.apache.org/jira/browse/DERBY-407?page=comments#action_12316672 ] 

A B commented on DERBY-407:
---------------------------


If you trim out the extra lines of the diff, the failure comes down to the 
following difference in query plan (this is just one tiny part of a very large 
query plan):

Master file --

  Left result set  = Table scan on CLASSIFICATION_VALUES (chosen by the 
optimizer)
  Right result set = Hash scan on REPOSITORYOBJECTRESOURCE

Actual (when the test fails) --

  Left result set = Index scan on REPOSITORYOBJECTRESOURCE (chosen by the 
optimizer)
  Right result set = Hash scan on CLASSIFICATION_VALUES

I don't know enough about how the optimizer works to know if one of these plans 
is inherently "better" than the other (I think it depends on how many rows are 
in the underlying tables).  However, I have confirmed (through discussion with 
Mike Matrigali) that the plan chosen by the optimizer _can_ be affected by the 
speed of the machine.  Internally, the optimizer does a check to make sure that 
it doesn't spend too much time trying to find the best query plan--and it will 
do a "timeout" if it thinks it's taking too long.  In particular, if the 
optimizer has already spent more time choosing a query plan than it thinks the 
best query plan so far is going to take to execute, the optimizer will stop 
looking at plans and just use the best plan so far.

What this means is that, on a faster machine, the optimizer can potentially see 
more query plans before timing out than it would on a slower machine--and thus 
it might find a better plan on the faster machine.

So far as I can tell, that's what's happening with this predicatesIntoViews 
failure.  While trying to prove this point, I noticed that there is an 
undocumented user property called "derby.optimizer.noTimeout" that keeps the 
optimizer from timing out--i.e. the optimizer will process ALL query plans and 
then pick the best one, regardless of how long it takes to do so.  When I set 
this property to true and ran the test, it passed 100% of the time, regardless 
of the speed of my CPU.

That said, I think an initial work-around for this problem is to set the 
derby.optimizer.noTimeout property for this test to true (using the harness).  
That should allow the test to pass on a consistent basis until someone can look 
further at the massive query plan to see what's happening and to determine if 
this particular failure is an optimizer bug (see notes below) or is just 
'working as designed'.


-----------
Other notes
-----------

While investigating this failure, I noticed a couple of behavioral oddities 
that could potentially help to determine if this is an optimizer bug.  I'm 
including them here for tracking purposes.  Note that these behaviors are only 
relevant if the noTimeout property is NOT set (if it _is_ set, then the test 
will always pass, regardless of what I describe below).

1) In an attempt to debug the failure, I added a bunch of "select count(*)" 
queries to the test to see how large the tables used by the query plan were, 
and also to see how large the result set of the query was.  As it turned out, 
adding such queries caused the test to consistently choose the second query 
plan listed above--i.e. the "fail" plan--regardless of the speed of my CPU.  
The specific queries I added are these (I added them just before execution of 
the query that's seeing the diff):

  SELECT COUNT(*) FROM XR.CLASSIFICATION_VALUES;
  SELECT COUNT(*) FROM XR.REPOSITORYOBJECTRESOURCE;
  SELECT COUNT(*) FROM XR.REPOSITORYOBJECTALLVERSIONVIEW;
  SELECT count(*)
  FROM xr.repositoryobjectallversionview rov 
  where (uname = UPPER('two') or uname = UPPER('my project'))
  and (versionid in
      (select versionid
       from xr.versionlabel
       where UPPER(label) = UPPER('Snapshot')))
    and deletedate is null;

It would appear that executing these SELECT COUNT(*) queries causes some 
internal state/row counts to be updated, which in turn results in a different 
query plan than what the test usually expects (even on faster machines).  But 
if that's the case, then it seems (to me) like the updated state/row counts 
should cause the optimizer to choose a _better_ plan than what it would 
otherwise choose--so it's not clear to me what's going on there.  Perhaps the 
updated state/row counts are incorrect or are being handled incorrectly, 
causing the optimizer to choose the wrong query plan...?

2) If I leave noTimeout set to false (its default) and do NOT add the "select 
count(*)" queries mentioned above, but I add system shutdown/reconnect logic to 
the test just before executing the query in question, then the test passes 
regardless of how fast/slow my CPU is.  Mike Matrigali suggested that this is 
because system shutdown causes all state/row counts to be flushed to disk, so 
reconnecting and then executing the query might allow the optimizer to use the 
latest state/row counts to find the best plan.  Of course, if that's true then 
it seems like we'd have at least a slight contradiction to the behavior 
described in #1 above, where updated stats/row counts are causing the test to 
_fail_...

Note that if the "select count(*)" queries are included in the test, then the 
test will fail regardless of whether or not we shutdown/reconnect.

The interaction between these behaviors and the noTimeout flag can be 
summarized as follows.

Let "TO" correspond to a noTimeout property that is TRUE.
  Then "!TO" is a noTimeout property that is FALSE (the
  default).

Let "SC" correspond to the presence of "select count(*)"
 queries in the test.  Then "!SC" means that such queries
 are NOT included.

Let "SH" correspond to the presence of shutdown/reconnect
 logic in the test.  Then "!SH" means that such logic is
 is NOT included.

Then we have the following behavior (at least, on my own laptop):

!TO && !SC && !SH ==> Intermittent failure on slower machines.
!TO && !SC &&  SH ==> Test passes regardless of CPU speed.
!TO &&  SC && !SH ==> Test fails regardless of CPU speed.
!TO &&  SC &&  SH ==> Test fails regardless of CPU speed.
 TO && !SC && !SH ==> Test passes regardless of CPU speed.
 TO && !SC &&  SH ==> Test passes regardless of CPU speed.
 TO &&  SC && !SH ==> Test passes regardless of CPU speed.
 TO &&  SC &&  SH ==> Test passes regardless of CPU speed.

Again, I'm not sure how much of this information is relevant info, but since 
these are all things I noticed while investigating the problem, I'm including 
them for the sake of completeness/tracking, in case they'll save someone some 
time in the future...

In the end, since setting noTimeout to TRUE seems like the simplest and most 
reliable way to get the test to pass on a consistent basis, that is what I'm 
proposing as a tempory workaround until further investigation can be carried 
out.  I will post a patch to do that shortly.


> predicatesIntoViews test failure on slow machine in Derby 10.1 branch version 
> 201931
> ------------------------------------------------------------------------------------
>
>          Key: DERBY-407
>          URL: http://issues.apache.org/jira/browse/DERBY-407
>      Project: Derby
>         Type: Test
>   Components: Test
>     Versions: 10.1.1.0
>  Environment: Java Version:    1.4.2_04
> Java Vendor:     Hewlett-Packard Company
> Java home:       /usr/tandem/java/jre
> Java classpath:  
> /home/john/j2ee/derbyTests/derby-10.1-branch/jars/derby.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbytools.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbynet.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyclient.jar:/home/john/j2ee/derbyTests/common-jars/db2jcc.jar:/home/john/j2ee/derbyTests/common-jars/db2jcc_license_c.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyTesting.jar:/home/john/j2ee/derbyTests/common-jars/jakarta-oro-2.0.8.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_de_DE.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_es.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_fr.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_it.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_ja_JP.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_ko_KR.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_pt_BR.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_zh_CN.jar:/home/john/j2ee/derbyTests/derby-10.1-branch/jars/derbyLocale_zh_TW.jar:/usr/tandem/java/lib/classes.zip:/usr/tandem/javaextv20/lib/tdmext.jar:/usr/tandem/jdbcMp/current/lib/sqlmp.jar:/usr/tandem/jdbcMx/current/lib/jdbcMx.jar:.
> OS name:         NONSTOP_KERNEL
> OS architecture: mips
> OS version:      G06
> Java user name:  SUPER.JOHN
> Java user home:  /home/john
> Java user dir:   /home/john/j2ee/derbyTests/derby-10.1-branch/testOutput
> java.specification.name: Java Platform API Specification
> java.specification.version: 1.4
>     Reporter: John Sisson
>  Attachments: derbylang_report.txt
>
> The predicatesIntoViews test (part of the derbylang suite) fails when running 
> the tests on a busy/slow system, but they pass when I run them on my fast 
> Windows box.
> It appears that the failure could be due to different optimizations being 
> performed due to the difference in speed of the machines.  This needs to be 
> confirmed that it is the case of the test failure.  I will attach the test 
> summary to this issue.
> I also noticed that the test report said 0% fail even though one test failed. 
> All derbylang tests passed on Windows XP. 
> Thanks, 
> John 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to