[ http://issues.apache.org/jira/browse/DERBY-1777?page=all ]
A B updated DERBY-1777:
-----------------------
Attachment: d1777_v1.patch
I ran the ViewerInit program attached to this issue and I hit two different
NPE's. For more details, see "Details" section below.
The short story is that I was able to determine the cause of the NPEs and have
a patch, d1777_v1.patch, to resolve them. There is, however, another issue
that prevents the ViewerInit program from running to completion (more on that
below). Nonetheless, d1777_v1.patch is at least a step in the right direction
as it corrects the two compile-time NPEs described below.
I ran derbyall on Red Hat Linux with ibm142 and sane jars, and I saw the
following failure in jdbcapi/secureUsers1.sql:
33a34,37
> do_ypcall: clnt_call: RPC: Unable to receive; errno = Connection refused
> YPBINDPROC_DOMAIN: Domain not bound
> do_ypcall: clnt_call: RPC: Unable to receive; errno = Connection refused
> YPBINDPROC_DOMAIN: Domain not bound
Test Failed.
When I ran the test standalone it passes against all frameworks, so I'm not
sure what happened. But in any event this does not appear to be related to my
changes.
So I'm posting d1777_v1.patch for review. Despite my efforts I haven't been
able to come up with a test case that can go into derbyall, but I'm still
trying. In the meantime, any comments/feedback would be appreciated.
--------
Details
--------
The first NPE came from BinaryRelationalOperatorNode.getScopedOperand() and was
caused by the fact that, when scoping a predicate for pushing, Derby couldn't
find the target result column to which the scoped predicate was supposed to
point. I confirmed this by running in SANE mode, where instead of an NPE I saw
the following ASSERT FAILURE:
ERROR XJ001: Java exception: 'ASSERT FAILED Failed to locate scope target
result column when trying to scope operand 'ENTITY_TO_PORT.PORT_ID'.:
org.apache.derby.shared.common.sanity.AssertFailure'.
An example query that leads to this NPE/assertion failure is as follows:
SELECT DISTINCT
ZONE.ZONE_ID ZONE_ID,
PORT.PORT_ID PORT_ID,
ENTITY_TO_PORT.TYPE,
ENTITY_TO_PORT.PREFIX_ID,
ENTITY_TO_PORT.ENTITY_ID,
ENTITY_TO_PORT.DISPLAY_NAME,
ENTITY_TO_PORT.PORT_DISPLAY_NAME,
PORT2ZONE.MEMBER_NAME,
PORT2ZONE.ZONE_MEMBER_ID,
PORT.PORT_NUMBER
FROM
T_RES_ZONE ZONE
left outer join
T_VIEW_PORT2ZONE PORT2ZONE
on
ZONE.ZONE_ID = PORT2ZONE.ZONE_ID
left outer join
T_RES_PORT PORT
on
PORT2ZONE.PORT_ID = PORT.PORT_ID
left outer join
T_VIEW_ENTITY_TO_PORT ENTITY_TO_PORT
on
PORT2ZONE.PORT_ID = ENTITY_TO_PORT.PORT_ID
and PORT2ZONE.ZONE_ID = ENTITY_TO_PORT.ZONE_ID,
T_RES_FABRIC FABRIC
WHERE
PORT2ZONE.ZONE_ID = ZONE.ZONE_ID
and ZONE.FABRIC_WWN = FABRIC.FABRIC_WWN
and FABRIC.FABRIC_ID = ?
When scoping predicates for this query, we run into a situation where the
target result column corresponds to a subquery that has been flattened. Since
the process of flattening a query leads to the creation of "redundant" result
columns, we have to correctly handle the redundant result columns in order to
find the scope target column. That said, the logic for redundant result
columns is in ColumnReference.getSourceResultSet(int[]):
rcExpr = rc.getExpression();
colNum[0] = getColumnNumber();
while ((rcExpr != null) && (rcExpr instanceof ColumnReference))
{
colNum[0] = ((ColumnReference)rcExpr).getColumnNumber();
rc = ((ColumnReference)rcExpr).getSource();
/* If "rc" is redundant then that means ...
...
}
The thing to note here is that the logic for handling redundant rc's is inside
the "while" loop. This leads to an edge case that the above code won't catch:
namely, if the original "rc" as it exists BEFORE we enter the "while" loop is
redundant, we'll only execute the redundancy logic IF rcExpr is an instance of
ColumnReference. But there's no guarantee that rcExpr will actually be a
ColumnReference--and if it's not, we'll incorrectly skip the logic for handling
the redundant rc. That in turn means we'll be unable to find the actual source
result set, and thus the method will return null, leading to the
above-mentioned assertion failure/NPE.
To fix this, I made a small change to ensure that the redundancy logic always
get executed if rc is redundant:
- while ((rcExpr != null) && (rcExpr instanceof ColumnReference))
+ /* We have to make sure we enter this loop if rc is redundant,
+ * so that we can navigate down to the actual source result
+ * set (DERBY-1777). If rc *is* redundant, then rcExpr is not
+ * guaranteed to be a ColumnReference, so we have to check
+ * for that case inside the loop.
+ */
+ while ((rcExpr != null) &&
+ (rc.isRedundant() || (rcExpr instanceof ColumnReference)))
{
- colNum[0] = ((ColumnReference)rcExpr).getColumnNumber();
- rc = ((ColumnReference)rcExpr).getSource();
+ if (rcExpr instanceof ColumnReference)
+ {
+ colNum[0] = ((ColumnReference)rcExpr).getColumnNumber();
+ rc = ((ColumnReference)rcExpr).getSource();
+ }
/* If "rc" is redundant then that means ...
...
}
Once this change was made, the first NPE went away and the ViewerInit program
ran a little longer, then failed with a second NPE. As it turns out, the
second NPE is intermittent and very time-sensitive. When it happens, the
failure occurs because the "outerCost" field that is passed to a query subtree
from OptimizerImpl.costPermutation() is null:
/*
** Get the cost of the outer plan so far. This gives us the current
** estimated rows, ordering, etc.
*/
CostEstimate outerCost;
if (joinPosition == 0)
{
outerCost = outermostCostEstimate;
}
else
{
/*
** NOTE: This is somewhat problematic. We assume here that the
** outer cost from the best access path for the outer table
** is OK to use even when costing the sort avoidance path for
** the inner table. This is probably OK, since all we use
** from the outer cost is the row count.
*/
outerCost =
optimizableList.getOptimizable(
proposedJoinOrder[joinPosition - 1]).
getBestAccessPath().getCostEstimate();
}
At this point we expect outerCost to be non-null, but it turns out that there's
a bug elsewhere in the code that leads to a null outerCost here, which is then
passed down the tree:
/* Cost the optimizable at the current join position */
optimizable.optimizeIt(this,
predicateList,
outerCost,
currentRowOrdering);
Any attempts to access outerCost further down the tree will then result in an
NPE.
An example of a query that (intermittently) shows this NPE (against the "Aperi"
database):
SELECT DISTINCT
''server:'' || CAST(HOST2PORT.HOST_ID as CHAR) ENTITY_KEY,
PORT2ZONE.ZONE_ID ZONE_ID
FROM
T_VIEW_VHOST2PORT HOST2PORT,
T_VIEW_PORT2ZONE PORT2ZONE
WHERE
HOST2PORT.HOST_ID = ?
and HOST2PORT.PORT_ID = PORT2ZONE.PORT_ID
The actual bug is in the getNextPermutation() method of the same class
(OptimizerImpl):
// If we were in the middle of a join order when this
// happened, then reset the join order before jumping.
// The call to rewindJoinOrder() here will put joinPosition
// back to 0. But that said, we'll then end up incrementing
// joinPosition before we start looking for the next join
// order (see below), which means we need to set it to -1
// here so that it gets incremented to "0" and then
// processing can continue as normal from there. Note:
// we don't need to set reloadBestPlan to true here
// because we only get here if we have *not* found a
// best plan yet.
if (joinPosition > 0)
{
rewindJoinOrder();
joinPosition = -1;
}
The problem with this code is that it only rewinds the join order if
joinPosition is GREATER than 0--but a joinPosition that is EQUAL to zero
indicates that we're "in the middle of a join order", as well, and thus we need
to rewind in that case, too. If we don't rewind, we can end up with an invalid
join order and that indirectly leads to the NPE mentioned above.
As a brief example, assume we have an optimizable list with two Optimizables in
it, O1 and O2. Let's also assume that we've just finished optimizing the first
one. So the current join order will be [O1, -].
Then timeout occurs so we enter the block of code in which the above "if"
statement sits. At that point joinPosition will be "0" because we just found
the best cost for the first optimizable and we haven't incremented joinPosition
yet. We'll then "jump" to what we think is going to be the best join order,
which we call "firstLookOrder" (see the code for more details). Let's assume
firstLookOrder is [O2,O1]. Now, because joinPosition is 0 we won't enter the
above the "if" block and thus we will NOT rewind the join order. So we'll then
increment joinPosition to "1" and we'll choose the optimizable at
firstLookOrder[joinPosition] as the next one in the current join order.
firstLookOrder[1] returns optimizable "O1", which means that, since we didn't
"rewind" the join order, our new current join order becomes [O1, O1]--which is
not a valid join order.
The reason this leads to an NPE is that whenever an optimizable is placed, the
best cost estimate for that optimizable is set to null. Thus when we place O1
at position "1" we set it's best access path's cost estimate to null. Then
later, when we get to the costPermutation() code shown above, we take the best
cost of the optimizable at position "0" and use that as the "outerCost" for the
optimizable at position "1". But in this those two optimizables are the
SAME--namley, O1. So we effectively nulled out O1's best cost, then we used
that very same (null) cost as the "outerCost" for optimizing O1. When that
outerCost is eventually referenced later, we end up with the NPE.
All of that said, note the above "if" statement is only executed in situations
where we have an optimizer timeout at a very particular point during
optimization. This is why the NPE is intermittent, and it also explains why
it\ won't reproduce if optimizer timeout is disabled.
The fix for this NPE is a one-line change (plus relevant comment updates):
- if (joinPosition > 0)
+ /* If we already assigned at least one position in the
+ * join order when this happened (i.e. if joinPosition
+ * is greater than *or equal* to zero; DERBY-1777), then
+ * reset the join order before jumping. The call to
+ * rewindJoinOrder() here will put joinPosition back
+ * to 0. But that said, we'll then end up incrementing
+ * joinPosition before we start looking for the next
+ * join order (see below), which means we need to set
+ * it to -1 here so that it gets incremented to "0" and
+ * then processing can continue as normal from there.
+ * Note: we don't need to set reloadBestPlan to true
+ * here because we only get here if we have *not* found
+ * a best plan yet.
+ */
+ if (joinPosition >= 0)
I'm attaching a patch, d1777_v1.patch, that makes these two changes to resolve
the NPE's discussed here. Note, though, that I still need to add an
appropriate test case to derbyall. This test case will only be for the first
NPE; the second NPE is timing-dependent and will not reproduce with "noTimeout"
set to true, so I don't think we'll have a test case for that one.
Also note: with d1777_v1.patch applied, the repro program attached to this Jira
still will not run without error (sigh). The two NPE's disappear and the test
program gets to the "L2" queries, but at that point the queries take a very
(very) long time to compile and then fail with an ASSERT failure at execution
time:
org.apache.derby.shared.common.sanity.AssertFailure: ASSERT FAILED
sourceResultSetNumber expected to be >= 0 for SWITCH.SWITCH_ID
That is (of course) with SANE jars; I don't know what that translates into for
INSANE jars because I haven't had the time to re-run the queries with sane
jars. It could end up being an execution-time (as opposed to a compile-time)
NPE but I don't that for sure. I'm still investigating.
> Regression: query works in 10.1.2.1 but fails with NullPointerException in
> 10.2.1.1
> -----------------------------------------------------------------------------------
>
> Key: DERBY-1777
> URL: http://issues.apache.org/jira/browse/DERBY-1777
> Project: Derby
> Issue Type: Bug
> Environment: WinXP SP2 dualcore 2.8 GHz 2 GBmemory
> Reporter: Prasenjit Sarkar
> Assigned To: A B
> Fix For: 10.2.1.0
>
> Attachments: Aperi.zip, d1777_v1.patch, Derby1777.zip
>
>
> However, here's a query that works in 10.1.2.1 but not in 10.2.1.1 --
> database can be assumed to be the same in Derby - 1205
> SELECT DISTINCT
> ZONE.ZONE_ID ZONE_ID,
> PORT.PORT_ID PORT_ID,
> ENTITY_TO_PORT.TYPE,
> ENTITY_TO_PORT.PREFIX_ID,
> ENTITY_TO_PORT.ENTITY_ID,
> ENTITY_TO_PORT.DISPLAY_NAME,
> ENTITY_TO_PORT.PORT_DISPLAY_NAME,
> PORT2ZONE.MEMBER_NAME,
> PORT2ZONE.ZONE_MEMBER_ID,
> PORT.PORT_NUMBER
> FROM
> T_RES_ZONE ZONE left outer join T_VIEW_PORT2ZONE PORT2ZONE on
> ZONE.ZONE_ID = PORT2ZONE.ZONE_ID left outer join T_RES_PORT PORT on
> PORT2ZONE.PORT_ID = PORT.PORT_ID left outer join T_VIEW_ENTITY_TO_PORT
> ENTITY_TO_PORT on
> PORT2ZONE.PORT_ID = ENTITY_TO_PORT.PORT_ID and
> PORT2ZONE.ZONE_ID = ENTITY_TO_PORT.ZONE_ID, T_RES_FABRIC FABRIC
> WHERE PORT2ZONE.ZONE_ID = ZONE.ZONE_ID and
> ZONE.FABRIC_WWN = FABRIC.FABRIC_WWN and
> FABRIC.FABRIC_ID = 1
> Same db as before.
> In 10.2.1.1 it gives the following error (should this be a new issue?)
> java.sql.SQLException: DERBY SQL error: SQLCODE: -1, SQLSTATE: XJ001,
> SQLERRMC: java.lang.NullPointerExceptionXJ001.U
> at org.apache.derby.client.am.SQLExceptionFactory.getSQLException(Unknown
> Source)
> at org.apache.derby.client.am.SqlException.getSQLException(Unknown Source)
> at org.apache.derby.client.am.Connection.prepareStatement(Unknown Source)
> at
> org.eclipse.aperi.server.guireq.topology.views.ViewerSanL1.init(ViewerSanL1.java:1828)
>
> at
> org.eclipse.aperi.server.guireq.topology.views.ViewerInit.init(ViewerInit.java:41)
>
> at
> org.eclipse.aperi.server.guireq.topology.views.ViewerInit.main(ViewerInit.java:69)
>
> Caused by: org.apache.derby.client.am.SqlException: DERBY SQL error: SQLCODE:
> -1, SQLSTATE: XJ001, SQLERRMC: java.lang.NullPointerExceptionXJ001.U
> at org.apache.derby.client.am.Statement.completeSqlca(Unknown Source)
> at org.apache.derby.client.net.NetStatementReply.parsePrepareError(Unknown
> Source)
> at org.apache.derby.client.net.NetStatementReply.parsePRPSQLSTTreply(Unknown
> Source)
> at
> org.apache.derby.client.net.NetStatementReply.readPrepareDescribeOutput(Unknown
> Source)
> at
> org.apache.derby.client.net.StatementReply.readPrepareDescribeOutput(Unknown
> Source)
> at
> org.apache.derby.client.net.NetStatement.readPrepareDescribeOutput_(Unknown
> Source)
> at org.apache.derby.client.am.Statement.readPrepareDescribeOutput(Unknown
> Source)
> at
> org.apache.derby.client.am.PreparedStatement.readPrepareDescribeInputOutput(Unknown
> Source)
> at
> org.apache.derby.client.am.PreparedStatement.flowPrepareDescribeInputOutput(Unknown
> Source)
> at org.apache.derby.client.am.PreparedStatement.prepare(Unknown Source)
> at org.apache.derby.client.am.Connection.prepareStatementX(Unknown Source)
> ... 4 more
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira