[jira] [Commented] (DERBY-5552) Derby threads hanging when using ClientXADataSource and a deadlock or lock timeout occurs
[ https://issues.apache.org/jira/browse/DERBY-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177292#comment-13177292 ] Brett Bergquist commented on DERBY-5552: I have found the cause of the problem. When a lock timeout or deadlock is detected, the server calls XATransactionState.cleanupOnError. This looks like: public void cleanupOnError(Throwable t) { if (t instanceof StandardException) { StandardException se = (StandardException) t; if (se.getSeverity() = ExceptionSeverity.SESSION_SEVERITY) { popMe(); return; } if (se.getSeverity() == ExceptionSeverity.TRANSACTION_SEVERITY) { synchronized (this) { // disable use of the connection until it is cleaned up. conn.setApplicationConnection(null); notifyAll(); associationState = TRO_FAIL; if (SQLState.DEADLOCK.equals(se.getMessageId())) rollbackOnlyCode = XAException.XA_RBDEADLOCK; else if (SQLState.LOCK_TIMEOUT.equals(se.getMessageId())) rollbackOnlyCode = XAException.XA_RBTIMEOUT; else rollbackOnlyCode = XAException.XA_RBOTHER; } } } } The problem is the line of code: conn.setApplicationConnection(null); The problem that occurs is on the client side, when the SQLException is received, it ends up calling Sqlca.getMessage() to retrieve the formatted exception message. This makes a call back down to the server on the connection and ends up calling EmbedStatement.checkStatus() and now the EmbedConnection has a null applicationConnection and a noCurrentConnection is throw. DRDA code that receives this exception in processing of Sqlca.getMessage() determines that there is a protocol error and disconnects from the server. The XA transaction that was in process never has end called on it and the XA transaction on the client side is now lost. Derby now has a XA transaction that will never end causing all kinds of havoc such as logging all new transactions in case the one lost ever does get rolled back. The file system fill up with transaction logs, restarting the database engine takes days, etc. I have commented out the above line and now the proper lock error is actually reported at the client. I don't know if there are any ramifications of doing so at this point however. Derby threads hanging when using ClientXADataSource and a deadlock or lock timeout occurs - Key: DERBY-5552 URL: https://issues.apache.org/jira/browse/DERBY-5552 Project: Derby Issue Type: Bug Components: Network Server Affects Versions: 10.8.1.2 Environment: Solaris 10, Glassfish V2.1.1, Reporter: Brett Bergquist Priority: Blocker Attachments: appserverstack.txt, client.tar.Z, derby.log, derbystackatshutdown.txt, execute.patch, transactionsleft.txt The issue arrives when multiple XA transactions are done in parallel and there is either a lock timeout or a lock deadlock detected. When this happens the connection is leaked in the Glassfish connection pool and the client thread hangs in org.apache.derby.client.netReply.fill(Reply.java:172). Shutting down the app server fails because the thread has a lock in org.apache.derby.client.net.NetConnection40 and another task is calling org.apache.derby.client.ClientPooledConnection.close(ClientPooledConnection.java:214) which is waiting for the lock. Killing the appsever using kill and then attempting to shutdown Derby network server causes the Network Server to hang. One of the threads hangs waiting for a lock at org.apache.derby.impl.drda.NeworkServerControlImpl.removeFromSessionTable(NetworkServerControlImpl.java:1525) and the main thread has this locked at org.apache.derby.impl.drda.NetworkServerControlImpl.executeWork(NetworkServerControlImpl.java:2242) and it itself is waiting for a lock which belongs to a thread that is stuck at org.apache.derby.impl.services.locks.ActiveLock.waitForGrant(ActiveLock.java:118) which is in the TIMED_WAITING state. Only by killing the Network Server using kill is possible at this point.
I see what might be a problem in the code handling lock timeout exceptions
In the code in various places there are calls such as: // 2 kinds of errors here expected here. Either container not // found or could not obtain lock (LOCK_TIMEOUT or DEADLOCK). // // It is possible by the time this post commit work gets scheduled // that the container has been dropped and that the open container // call will return null - in this case just return assuming no // work to be done. if (se.getMessageId().equals(SQLState.LOCK_TIMEOUT) || se.getMessageId().equals(SQLState.DEADLOCK)) Or // First try to do the work in the nested transaction. Fail if we can't // get a lock immediately. if ( nestedTransaction != null ) { try { return updateCurrentValueOnDisk( nestedTransaction, oldValue, newValue, false ); } catch (StandardException se) { if ( !se.getMessageId().equals( SQLState.LOCK_TIMEOUT ) ) { throw se; } } Or // exception might have occured either container got dropper or lock not granted. // It is possible by the time this post commit work gets scheduled // that the container has been dropped and that the open container // call will return null - in this case just return assuming no // work to be done. //If this expcetion is because lock could not be obtained , work is requeued. if (se.getMessageId().equals(SQLState.LOCK_TIMEOUT) || se.getMessageId().equals(SQLState.DEADLOCK)) { requeue_work = true; } The problem that I see is that if the property derby.locks.deadlockTrace=true is set, then instead of a SQLState.LOCK_TIMEOUT, the code will see a SQLState.LOCK_TIMEOUT_LOG. Note that this is not being checked for in the above tests and others as well, so now the code behavior is going to change basd on whether the lock tracing is enabled or not. I think that 99% of the code that is testing for SQLState.LOCK_TIMEOUT should also be checking for SQLState.LOCK_TIMEOUT_LOG as well.I only see one place in DDLConstantAction where it is explicitly mentioned that SQLState.LOCK_TIMEOUT_LOG is not being looked at.
Regression Test Report - Daily 1225233 - Sun DBTG
[Auto-generated mail] *Daily* 1225233/2011-12-28 18:00:08 MET Failed TestsOK Skip Duration Suite --- *Jvm: 1.7* lin 01471014710 0 .% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .% JDBCDriversAll 01515 0 .% JDBCDriversClient 01414 0 .% JDBCDriversEmbedded 0179179 0 .% derbyall 022 0 .% compatibility 022 0 .% demoSuite sol 01471014710 0 .% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .% JDBCDriversAll 01515 0 .% JDBCDriversClient 01414 0 .% JDBCDriversEmbedded 0179179 0 .% derbyall 022 0 .% compatibility 022 0 .% demoSuite sol32 01471014710 0 .% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .% JDBCDriversAll 01515 0 .% JDBCDriversClient 01414 0 .% JDBCDriversEmbedded 0179179 0 .% derbyall 022 0 .% compatibility 022 0 .% demoSuite vista-64 01471014710 0 .% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .% JDBCDriversAll 01515 0 .% JDBCDriversClient 01414 0 .% JDBCDriversEmbedded 0179179 0 .% derbyall NA NA NANA compatibility 022 0 .% demoSuite Details in http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.7/testing/Limited/testSummary-1225233.html Attempted failure analysis in http://dbtg.foundry.sun.com/derby/test/Daily/jvm1.7/FailReports/1225233_bySig.html *Jvm: 1.6* lin 01460514605 0 1562.08% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .% JDBCDriversAll 01515 0 .% JDBCDriversClient 01414 0 .% JDBCDriversEmbedded 0179179 041.53% derbyall 022 0 135.03% compatibility 022 0 .% demoSuite sles 01460514605 0 977.40% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .% JDBCDriversAll 01515 0 .% JDBCDriversClient 01414 0 .% JDBCDriversEmbedded 0179179 039.33% derbyall 022 084.18% compatibility 022 0 .% demoSuite sol 01459314593 0 1128.54% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .% JDBCDriversAll 01515 0 .% JDBCDriversClient 01414 0 .% JDBCDriversEmbedded 0179179 033.38% derbyall 022 090.08% compatibility 022 0 .% demoSuite sol32 01459314593 082.72% suitesAll 01515 0 323.07% jdbcapiAutoLoad 01414 0 341.66% JDBCDriversAll 01515 0 315.38% JDBCDriversClient 01414 016.66% JDBCDriversEmbedded 0179179 0 1246.69% derbyall 022 0 19700.00% compatibility 022 0 700.00% demoSuite solN+1 01459314593 0 223.05% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .% JDBCDriversAll 01515 0 .% JDBCDriversClient 01414 0 .% JDBCDriversEmbedded 0179179 054.38% derbyall 022 097.89% compatibility 022 0 .% demoSuite sparc 01459314593 0 575.06% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .% JDBCDriversAll 01515 0 .% JDBCDriversClient 01414 0 .% JDBCDriversEmbedded 0179179 026.01% derbyall 022 082.33% compatibility 022 0 .% demoSuite vista 01460314603 0 188.36% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .% JDBCDriversAll 01515 0 .% JDBCDriversClient 01414 0 .% JDBCDriversEmbedded 0179179 048.89% derbyall NA NA NANA compatibility 022 0 .% demoSuite vista-64 01460314603 0 275.06% suitesAll 01515 0 .% jdbcapiAutoLoad 01414 0 .%