[jira] Updated: (DERBY-1777) Regression: query works in 10.1.2.1 but fails with NullPointerException in 10.2.1.1

A B (JIRA) Mon, 11 Sep 2006 15:46:49 -0700

     [ http://issues.apache.org/jira/browse/DERBY-1777?page=all ]


A B updated DERBY-1777:
-----------------------

    Attachment: d1777_v1.patch

I ran the ViewerInit program attached to this issue and I hit two different 
NPE's.  For more details, see "Details" section below.

The short story is that I was able to determine the cause of the NPEs and have 
a patch, d1777_v1.patch, to resolve them.  There is, however, another issue 
that prevents the ViewerInit program from running to completion (more on that 
below).  Nonetheless, d1777_v1.patch is at least a step in the right direction 
as it corrects the two compile-time NPEs described below.

I ran derbyall on Red Hat Linux with ibm142 and sane jars, and I saw the 
following failure in jdbcapi/secureUsers1.sql:

33a34,37
> do_ypcall: clnt_call: RPC: Unable to receive; errno = Connection refused
> YPBINDPROC_DOMAIN: Domain not bound
> do_ypcall: clnt_call: RPC: Unable to receive; errno = Connection refused
> YPBINDPROC_DOMAIN: Domain not bound
Test Failed.

When I ran the test standalone it passes against all frameworks, so I'm not 
sure what happened.  But in any event this does not appear to be related to my 
changes.

So I'm posting d1777_v1.patch for review.  Despite my efforts I haven't been 
able to come up with a test case that can go into derbyall, but I'm still 
trying.  In the meantime, any comments/feedback would be appreciated.

--------
Details
--------

The first NPE came from BinaryRelationalOperatorNode.getScopedOperand() and was 
caused by the fact that, when scoping a predicate for pushing, Derby couldn't 
find the target result column to which the scoped predicate was supposed to 
point.  I confirmed this by running in SANE mode, where instead of an NPE I saw 
the following ASSERT FAILURE:

ERROR XJ001: Java exception: 'ASSERT FAILED Failed to locate scope target 
result column when trying to scope operand 'ENTITY_TO_PORT.PORT_ID'.: 
org.apache.derby.shared.common.sanity.AssertFailure'.

An example query that leads to this NPE/assertion failure is as follows:

  SELECT DISTINCT

     ZONE.ZONE_ID ZONE_ID,
     PORT.PORT_ID PORT_ID,
     ENTITY_TO_PORT.TYPE,
     ENTITY_TO_PORT.PREFIX_ID,
     ENTITY_TO_PORT.ENTITY_ID,
     ENTITY_TO_PORT.DISPLAY_NAME,
     ENTITY_TO_PORT.PORT_DISPLAY_NAME,
     PORT2ZONE.MEMBER_NAME,
     PORT2ZONE.ZONE_MEMBER_ID,
     PORT.PORT_NUMBER

  FROM

     T_RES_ZONE ZONE
       left outer join
           T_VIEW_PORT2ZONE PORT2ZONE
       on
           ZONE.ZONE_ID = PORT2ZONE.ZONE_ID
     left outer join
           T_RES_PORT PORT
       on
           PORT2ZONE.PORT_ID = PORT.PORT_ID
     left outer join
           T_VIEW_ENTITY_TO_PORT ENTITY_TO_PORT
       on
           PORT2ZONE.PORT_ID = ENTITY_TO_PORT.PORT_ID
           and PORT2ZONE.ZONE_ID = ENTITY_TO_PORT.ZONE_ID,
     T_RES_FABRIC FABRIC

  WHERE

     PORT2ZONE.ZONE_ID = ZONE.ZONE_ID
     and ZONE.FABRIC_WWN = FABRIC.FABRIC_WWN
     and FABRIC.FABRIC_ID = ?

When scoping predicates for this query, we run into a situation where the 
target result column corresponds to a subquery that has been flattened.  Since 
the process of flattening a query leads to the creation of "redundant" result 
columns, we have to correctly handle the redundant result columns in order to 
find the scope target column.  That said, the logic for redundant result 
columns is in ColumnReference.getSourceResultSet(int[]):

        rcExpr = rc.getExpression();
        colNum[0] = getColumnNumber();

        while ((rcExpr != null) && (rcExpr instanceof ColumnReference))
        {
            colNum[0] = ((ColumnReference)rcExpr).getColumnNumber();
            rc = ((ColumnReference)rcExpr).getSource();

            /* If "rc" is redundant then that means ...
            ...
        }

The thing to note here is that the logic for handling redundant rc's is inside 
the "while" loop.  This leads to an edge case that the above code won't catch: 
namely, if the original "rc" as it exists BEFORE we enter the "while" loop is 
redundant, we'll only execute the redundancy logic IF rcExpr is an instance of 
ColumnReference.  But there's no guarantee that rcExpr will actually be a 
ColumnReference--and if it's not, we'll incorrectly skip the logic for handling 
the redundant rc.  That in turn means we'll be unable to find the actual source 
result set, and thus the method will return null, leading to the 
above-mentioned assertion failure/NPE.

To fix this, I made a small change to ensure that the redundancy logic always 
get executed if rc is redundant:

-        while ((rcExpr != null) && (rcExpr instanceof ColumnReference))
+        /* We have to make sure we enter this loop if rc is redundant,
+         * so that we can navigate down to the actual source result
+         * set (DERBY-1777). If rc *is* redundant, then rcExpr is not
+         * guaranteed to be a ColumnReference, so we have to check
+         * for that case inside the loop.
+         */
+        while ((rcExpr != null) &&
+            (rc.isRedundant() || (rcExpr instanceof ColumnReference)))
         {
-            colNum[0] = ((ColumnReference)rcExpr).getColumnNumber();
-            rc = ((ColumnReference)rcExpr).getSource();
+            if (rcExpr instanceof ColumnReference)
+            {
+                colNum[0] = ((ColumnReference)rcExpr).getColumnNumber();
+                rc = ((ColumnReference)rcExpr).getSource();
+            }

             /* If "rc" is redundant then that means ...
             ...
         }
 
Once this change was made, the first NPE went away and the ViewerInit program 
ran a little longer, then failed with a second NPE.  As it turns out, the 
second NPE is intermittent and very time-sensitive.  When it happens, the 
failure occurs because the "outerCost" field that is passed to a query subtree 
from OptimizerImpl.costPermutation() is null:

        /*
        ** Get the cost of the outer plan so far.  This gives us the current
        ** estimated rows, ordering, etc.
        */
        CostEstimate outerCost;
        if (joinPosition == 0)
        {
            outerCost = outermostCostEstimate;
        }
        else
        {
            /*
            ** NOTE: This is somewhat problematic.  We assume here that the
            ** outer cost from the best access path for the outer table
            ** is OK to use even when costing the sort avoidance path for
            ** the inner table.  This is probably OK, since all we use
            ** from the outer cost is the row count.
            */
            outerCost =
                optimizableList.getOptimizable(
                    proposedJoinOrder[joinPosition - 1]).
                        getBestAccessPath().getCostEstimate();
        }

At this point we expect outerCost to be non-null, but it turns out that there's 
a bug elsewhere in the code that leads to a null outerCost here, which is then 
passed down the tree:

        /* Cost the optimizable at the current join position */
        optimizable.optimizeIt(this,
                               predicateList,
                               outerCost,
                               currentRowOrdering);

Any attempts to access outerCost further down the tree will then result in an 
NPE.

An example of a query that (intermittently) shows this NPE (against the "Aperi" 
database):

  SELECT DISTINCT

     ''server:'' || CAST(HOST2PORT.HOST_ID as CHAR) ENTITY_KEY,
     PORT2ZONE.ZONE_ID ZONE_ID

  FROM

     T_VIEW_VHOST2PORT HOST2PORT,
     T_VIEW_PORT2ZONE PORT2ZONE

  WHERE

     HOST2PORT.HOST_ID = ?
     and HOST2PORT.PORT_ID = PORT2ZONE.PORT_ID


The actual bug is in the getNextPermutation() method of the same class 
(OptimizerImpl):

    // If we were in the middle of a join order when this
    // happened, then reset the join order before jumping.
    // The call to rewindJoinOrder() here will put joinPosition
    // back to 0.  But that said, we'll then end up incrementing 
    // joinPosition before we start looking for the next join
    // order (see below), which means we need to set it to -1
    // here so that it gets incremented to "0" and then
    // processing can continue as normal from there.  Note:
    // we don't need to set reloadBestPlan to true here
    // because we only get here if we have *not* found a
    // best plan yet.
    if (joinPosition > 0)
    {
        rewindJoinOrder();
        joinPosition = -1;
    }

The problem with this code is that it only rewinds the join order if 
joinPosition is GREATER than 0--but a joinPosition that is EQUAL to zero 
indicates that we're "in the middle of a join order", as well, and thus we need 
to rewind in that case, too.  If we don't rewind, we can end up with an invalid 
join order and that indirectly leads to the NPE mentioned above.

As a brief example, assume we have an optimizable list with two Optimizables in 
it, O1 and O2.  Let's also assume that we've just finished optimizing the first 
one.  So the current join order will be [O1, -].

Then timeout occurs so we enter the block of code in which the above "if" 
statement sits.  At that point joinPosition will be "0" because we just found 
the best cost for the first optimizable and we haven't incremented joinPosition 
yet.  We'll then "jump" to what we think is going to be the best join order, 
which we call "firstLookOrder" (see the code for more details).  Let's assume 
firstLookOrder is [O2,O1].  Now, because joinPosition is 0 we won't enter the 
above the "if" block and thus we will NOT rewind the join order.  So we'll then 
increment joinPosition to "1" and we'll choose the optimizable at 
firstLookOrder[joinPosition] as the next one in the current join order.  
firstLookOrder[1] returns optimizable "O1", which means that, since we didn't 
"rewind" the join order, our new current join order becomes [O1, O1]--which is 
not a valid join order.

The reason this leads to an NPE is that whenever an optimizable is placed, the 
best cost estimate for that optimizable is set to null.  Thus when we place O1 
at position "1" we set it's best access path's cost estimate to null.  Then 
later, when we get to the costPermutation() code shown above, we take the best 
cost of the optimizable at position "0" and use that as the "outerCost" for the 
optimizable at position "1".  But in this  those two optimizables are the 
SAME--namley, O1.  So we effectively nulled out O1's best cost, then we used 
that very same (null) cost as the "outerCost" for optimizing O1.  When that 
outerCost is eventually referenced later, we end up with the NPE.

All of that said, note the above "if" statement is only executed in situations 
where we have an optimizer timeout at a very particular point during 
optimization.  This is why the NPE is intermittent, and it also explains why 
it\ won't reproduce if optimizer timeout is disabled.

The fix for this NPE is a one-line change (plus relevant comment updates):

-    if (joinPosition > 0)
+    /* If we already assigned at least one position in the
+     * join order when this happened (i.e. if joinPosition
+     * is greater than *or equal* to zero; DERBY-1777), then 
+     * reset the join order before jumping.  The call to
+     * rewindJoinOrder() here will put joinPosition back
+     * to 0.  But that said, we'll then end up incrementing
+     * joinPosition before we start looking for the next
+     * join order (see below), which means we need to set
+     * it to -1 here so that it gets incremented to "0" and
+     * then processing can continue as normal from there.  
+     * Note: we don't need to set reloadBestPlan to true
+     * here because we only get here if we have *not* found
+     * a best plan yet.
+     */
+    if (joinPosition >= 0)

I'm attaching a patch, d1777_v1.patch, that makes these two changes to resolve 
the NPE's discussed here.  Note, though, that I still need to add an 
appropriate test case to derbyall.  This test case will only be for the first 
NPE; the second NPE is timing-dependent and will not reproduce with "noTimeout" 
set to true, so I don't think we'll have a test case for that one.

Also note: with d1777_v1.patch applied, the repro program attached to this Jira 
still will not run without error (sigh).  The two NPE's disappear and the test 
program gets to the "L2" queries, but at that point the queries take a very 
(very) long time to compile and then fail with an ASSERT failure at execution 
time:

org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED 
sourceResultSetNumber expected to be >= 0 for SWITCH.SWITCH_ID

That is (of course) with SANE jars; I don't know what that translates into for 
INSANE jars because I haven't had the time to re-run the queries with sane 
jars.  It could end up being an execution-time (as opposed to a compile-time) 
NPE but I don't that for sure.  I'm still investigating.

> Regression: query works in 10.1.2.1 but fails with NullPointerException in 
> 10.2.1.1
> -----------------------------------------------------------------------------------
>
>                 Key: DERBY-1777
>                 URL: http://issues.apache.org/jira/browse/DERBY-1777
>             Project: Derby
>          Issue Type: Bug
>         Environment: WinXP SP2 dualcore 2.8 GHz 2 GBmemory
>            Reporter: Prasenjit Sarkar
>         Assigned To: A B
>             Fix For: 10.2.1.0
>
>         Attachments: Aperi.zip, d1777_v1.patch, Derby1777.zip
>
>
> However, here's a query that works in 10.1.2.1 but not in 10.2.1.1  -- 
> database can be assumed to be the same in Derby - 1205
> SELECT DISTINCT 
> ZONE.ZONE_ID ZONE_ID, 
> PORT.PORT_ID PORT_ID, 
> ENTITY_TO_PORT.TYPE, 
> ENTITY_TO_PORT.PREFIX_ID, 
> ENTITY_TO_PORT.ENTITY_ID, 
> ENTITY_TO_PORT.DISPLAY_NAME, 
> ENTITY_TO_PORT.PORT_DISPLAY_NAME, 
> PORT2ZONE.MEMBER_NAME, 
> PORT2ZONE.ZONE_MEMBER_ID, 
> PORT.PORT_NUMBER 
> FROM 
> T_RES_ZONE ZONE left outer join T_VIEW_PORT2ZONE PORT2ZONE on 
> ZONE.ZONE_ID = PORT2ZONE.ZONE_ID left outer join T_RES_PORT PORT on 
> PORT2ZONE.PORT_ID = PORT.PORT_ID left outer join T_VIEW_ENTITY_TO_PORT 
> ENTITY_TO_PORT on 
> PORT2ZONE.PORT_ID = ENTITY_TO_PORT.PORT_ID and 
> PORT2ZONE.ZONE_ID = ENTITY_TO_PORT.ZONE_ID, T_RES_FABRIC FABRIC 
> WHERE PORT2ZONE.ZONE_ID = ZONE.ZONE_ID and 
> ZONE.FABRIC_WWN = FABRIC.FABRIC_WWN and 
> FABRIC.FABRIC_ID = 1 
> Same db as before. 
> In 10.2.1.1 it gives the following error (should this be a new issue?) 
> java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ001, 
> SQLERRMC: java.lang.NullPointerExceptionXJ001.U 
> at org.apache.derby.client.am.SQLExceptionFactory.getSQLException(Unknown 
> Source) 
> at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source) 
> at org.apache.derby.client.am.Connection.prepareStatement(Unknown Source) 
> at 
> org.eclipse.aperi.server.guireq.topology.views.ViewerSanL1.init(ViewerSanL1.java:1828)
>  
> at 
> org.eclipse.aperi.server.guireq.topology.views.ViewerInit.init(ViewerInit.java:41)
>  
> at 
> org.eclipse.aperi.server.guireq.topology.views.ViewerInit.main(ViewerInit.java:69)
>  
> Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE: 
> -1, SQLSTATE: XJ001, SQLERRMC: java.lang.NullPointerExceptionXJ001.U 
> at org.apache.derby.client.am.Statement.completeSqlca(Unknown Source) 
> at org.apache.derby.client.net.NetStatementReply.parsePrepareError(Unknown 
> Source) 
> at org.apache.derby.client.net.NetStatementReply.parsePRPSQLSTTreply(Unknown 
> Source) 
> at 
> org.apache.derby.client.net.NetStatementReply.readPrepareDescribeOutput(Unknown
>  Source) 
> at 
> org.apache.derby.client.net.StatementReply.readPrepareDescribeOutput(Unknown 
> Source) 
> at 
> org.apache.derby.client.net.NetStatement.readPrepareDescribeOutput_(Unknown 
> Source) 
> at org.apache.derby.client.am.Statement.readPrepareDescribeOutput(Unknown 
> Source) 
> at 
> org.apache.derby.client.am.PreparedStatement.readPrepareDescribeInputOutput(Unknown
>  Source) 
> at 
> org.apache.derby.client.am.PreparedStatement.flowPrepareDescribeInputOutput(Unknown
>  Source) 
> at org.apache.derby.client.am.PreparedStatement.prepare(Unknown Source) 
> at org.apache.derby.client.am.Connection.prepareStatementX(Unknown Source) 
> ... 4 more 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (DERBY-1777) Regression: query works in 10.1.2.1 but fails with NullPointerException in 10.2.1.1

Reply via email to