[jira] [Commented] (DERBY-5552) Derby threads hanging when using ClientXADataSource and a deadlock or lock timeout occurs

2011-12-29 Thread Brett Bergquist (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/DERBY-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177292#comment-13177292
 ] 

Brett Bergquist commented on DERBY-5552:


I have found the cause of the problem.  When a lock timeout or deadlock is 
detected, the server calls XATransactionState.cleanupOnError.   This looks like:


public void cleanupOnError(Throwable t) {

if (t instanceof StandardException) {

StandardException se = (StandardException) t;

if (se.getSeverity() = ExceptionSeverity.SESSION_SEVERITY) {
popMe();
return;
}

if (se.getSeverity() == 
ExceptionSeverity.TRANSACTION_SEVERITY) {

synchronized (this) {
// disable use of the connection until 
it is cleaned up.
conn.setApplicationConnection(null);
notifyAll();
associationState = TRO_FAIL;
if 
(SQLState.DEADLOCK.equals(se.getMessageId()))
rollbackOnlyCode = 
XAException.XA_RBDEADLOCK;
else if 
(SQLState.LOCK_TIMEOUT.equals(se.getMessageId()))
rollbackOnlyCode = 
XAException.XA_RBTIMEOUT;
else
rollbackOnlyCode = 
XAException.XA_RBOTHER;
}
}
}
}

The problem is the line of code:

conn.setApplicationConnection(null);

The problem that occurs is on the client side, when the SQLException is 
received, it ends up calling Sqlca.getMessage() to retrieve the formatted 
exception message.  This makes a call back down to the server on the connection 
and ends up calling EmbedStatement.checkStatus() and now the EmbedConnection 
has a null applicationConnection and a noCurrentConnection is throw.   DRDA 
code that receives this exception in processing of Sqlca.getMessage() 
determines that there is a protocol error and disconnects from the server.

The XA transaction that was in process never has end called on it and the XA 
transaction on the client side is now lost.  Derby now has a XA transaction 
that will never end causing all kinds of havoc such as logging all new 
transactions in case the one lost ever does get rolled back.  The file system 
fill up with transaction logs, restarting the database engine takes days, etc.

I have commented out the above line and now the proper lock error is actually 
reported at the client.  I don't know if there are any ramifications of doing 
so at this point however.


 Derby threads hanging when using ClientXADataSource and a deadlock or lock 
 timeout occurs
 -

 Key: DERBY-5552
 URL: https://issues.apache.org/jira/browse/DERBY-5552
 Project: Derby
  Issue Type: Bug
  Components: Network Server
Affects Versions: 10.8.1.2
 Environment: Solaris 10, Glassfish V2.1.1,
Reporter: Brett Bergquist
Priority: Blocker
 Attachments: appserverstack.txt, client.tar.Z, derby.log, 
 derbystackatshutdown.txt, execute.patch, transactionsleft.txt


 The issue arrives when multiple XA transactions are done in parallel and 
 there is either a lock timeout or a lock deadlock detected.  When this 
 happens the connection is leaked in the Glassfish connection pool and the 
 client thread hangs in 
 org.apache.derby.client.netReply.fill(Reply.java:172).  
 Shutting down the app server fails because the thread has a lock in 
 org.apache.derby.client.net.NetConnection40 and another task is calling 
 org.apache.derby.client.ClientPooledConnection.close(ClientPooledConnection.java:214)
  which is waiting for the lock.
 Killing the appsever using kill and then attempting to shutdown Derby 
 network server causes the Network Server to hang.  One of the threads hangs 
 waiting for a lock at 
 org.apache.derby.impl.drda.NeworkServerControlImpl.removeFromSessionTable(NetworkServerControlImpl.java:1525)
  and the main thread has this locked at 
 org.apache.derby.impl.drda.NetworkServerControlImpl.executeWork(NetworkServerControlImpl.java:2242)
  and it itself is waiting for a lock which belongs to a thread that is stuck 
 at 
 org.apache.derby.impl.services.locks.ActiveLock.waitForGrant(ActiveLock.java:118)
  which is in the TIMED_WAITING state.
 Only by killing the Network Server using kill is possible at this point.

I see what might be a problem in the code handling lock timeout exceptions

2011-12-29 Thread Bergquist, Brett
In the code in various places there are calls such as:

// 2 kinds of errors here expected here.  Either container not
// found or could not obtain lock (LOCK_TIMEOUT or DEADLOCK).
//
// It is possible by the time this post commit work gets scheduled
// that the container has been dropped and that the open container
// call will return null - in this case just return assuming no
// work to be done.

if 
(se.getMessageId().equals(SQLState.LOCK_TIMEOUT) ||

se.getMessageId().equals(SQLState.DEADLOCK))


Or


// First try to do the work in the nested transaction. Fail if we can't
// get a lock immediately.
if ( nestedTransaction != null )
{
try {
return updateCurrentValueOnDisk( nestedTransaction, oldValue, 
newValue, false );
}
catch (StandardException se)
{
if ( !se.getMessageId().equals( SQLState.LOCK_TIMEOUT ) ) { 
throw se; }
}


Or

// exception might have occured either container got dropper or 
lock not granted.
// It is possible by the time this post commit work gets scheduled
// that the container has been dropped and that the open container
// call will return null - in this case just return assuming no
// work to be done.

//If this expcetion is because 
lock could not be obtained , work is requeued.
if 
(se.getMessageId().equals(SQLState.LOCK_TIMEOUT) ||

se.getMessageId().equals(SQLState.DEADLOCK))
{
requeue_work = 
true;
}

The problem that I see is that if the property derby.locks.deadlockTrace=true 
is set, then instead of a SQLState.LOCK_TIMEOUT, the code will see a 
SQLState.LOCK_TIMEOUT_LOG.  Note that this is not being checked for in the 
above tests and others as well, so now the code behavior is going to change 
basd on whether the lock tracing is enabled or not.

I think that 99% of the code that is testing for SQLState.LOCK_TIMEOUT should 
also be checking for SQLState.LOCK_TIMEOUT_LOG as well.I only see one place 
in DDLConstantAction where it is explicitly mentioned that 
SQLState.LOCK_TIMEOUT_LOG is not being looked at.




Regression Test Report - Daily 1225233 - Sun DBTG

2011-12-29 Thread Ole . Solberg
[Auto-generated mail]

*Daily* 1225233/2011-12-28 18:00:08 MET

Failed  TestsOK  Skip  Duration   Suite
---
*Jvm: 1.7*
 lin
01471014710 0   .% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .% JDBCDriversAll
01515 0   .% JDBCDriversClient
01414 0   .% JDBCDriversEmbedded
0179179 0   .% derbyall
022 0   .% compatibility
022 0   .% demoSuite
 sol
01471014710 0   .% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .% JDBCDriversAll
01515 0   .% JDBCDriversClient
01414 0   .% JDBCDriversEmbedded
0179179 0   .% derbyall
022 0   .% compatibility
022 0   .% demoSuite
 sol32
01471014710 0   .% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .% JDBCDriversAll
01515 0   .% JDBCDriversClient
01414 0   .% JDBCDriversEmbedded
0179179 0   .% derbyall
022 0   .% compatibility
022 0   .% demoSuite
 vista-64
01471014710 0   .% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .% JDBCDriversAll
01515 0   .% JDBCDriversClient
01414 0   .% JDBCDriversEmbedded
0179179 0   .% derbyall
   NA NA NANA   compatibility
022 0   .% demoSuite
  Details in  
http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.7/testing/Limited/testSummary-1225233.html
 
  Attempted failure analysis in
  
http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.7/FailReports/1225233_bySig.html
 

*Jvm: 1.6*
 lin
01460514605 0   1562.08% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .% JDBCDriversAll
01515 0   .% JDBCDriversClient
01414 0   .% JDBCDriversEmbedded
0179179 041.53% derbyall
022 0   135.03% compatibility
022 0   .% demoSuite
 sles
01460514605 0   977.40% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .% JDBCDriversAll
01515 0   .% JDBCDriversClient
01414 0   .% JDBCDriversEmbedded
0179179 039.33% derbyall
022 084.18% compatibility
022 0   .% demoSuite
 sol
01459314593 0   1128.54% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .% JDBCDriversAll
01515 0   .% JDBCDriversClient
01414 0   .% JDBCDriversEmbedded
0179179 033.38% derbyall
022 090.08% compatibility
022 0   .% demoSuite
 sol32
01459314593 082.72% suitesAll
01515 0   323.07% jdbcapiAutoLoad
01414 0   341.66% JDBCDriversAll
01515 0   315.38% JDBCDriversClient
01414 016.66% JDBCDriversEmbedded
0179179 0   1246.69% derbyall
022 0   19700.00% compatibility
022 0   700.00% demoSuite
 solN+1
01459314593 0   223.05% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .% JDBCDriversAll
01515 0   .% JDBCDriversClient
01414 0   .% JDBCDriversEmbedded
0179179 054.38% derbyall
022 097.89% compatibility
022 0   .% demoSuite
 sparc
01459314593 0   575.06% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .% JDBCDriversAll
01515 0   .% JDBCDriversClient
01414 0   .% JDBCDriversEmbedded
0179179 026.01% derbyall
022 082.33% compatibility
022 0   .% demoSuite
 vista
01460314603 0   188.36% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .% JDBCDriversAll
01515 0   .% JDBCDriversClient
01414 0   .% JDBCDriversEmbedded
0179179 048.89% derbyall
   NA NA NANA   compatibility
022 0   .% demoSuite
 vista-64
01460314603 0   275.06% suitesAll
01515 0   .% jdbcapiAutoLoad
01414 0   .%