RE: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
I opened DERBY-5552 https://issues.apache.org/jira/browse/DERBY-5552 I attached client side traces using traceLevel=2145 (TRACE_XA_CALLS|TRACE_PROTOCOL_FLOWS|TRACE_CONNECTS|TRACE_CONNECTION_CALL). I don't know if more is needed. I have sever side traces but it is 93Mb uncompressed and 13Mb compressed. Is there something I should look for in there to narrow down which trace files to include and upload? I have attached jstack traces of the application server and the Derby Network Server at shutdown showing the hung threads. I have attached a the output of select * from syscs_diag.transaction_table when there are no clients and no other database action showing transactions that are still present. I am trying to narrow down a test case better but have not been able to at this point.This is repeatable with my J2EE application every time however with the test setup that I have. Any further areas to look at with a debugger or outputting more tracing information will be greatly appreciated. From: Katherine Marsden [mailto:kmarsdende...@sbcglobal.net] Sent: Wednesday, December 21, 2011 7:25 PM To: derby-dev@db.apache.org Subject: Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1 On 12/21/2011 3:14 PM, Bergquist, Brett wrote: Will get to this tomorrow but I do see one comment in the code that I don't understand: In DRDAConnThread.java, I see: if (severity CodePoint.SVRCOD_ERROR) { // For a session ending error CodePoint.SRVCOD_ERROR you cannot // send a SQLERRRM. A CMDCHKRM is required. In XA if there is a // lock timeout it ends the whole session. I am not sure this // is the correct behaviour but if it occurs we have to send // a CMDCHKRM instead of SQLERRM writeCMDCHKRM(severity); } So what does the comment In XA if there is a lock timeout it ends the whole session refer to. Why would a lock timeout be any different than any other standard database error. It is like this is hinting at what is happening. This is a real XA transaction. What I see is that after the timeout is hit (I see it hit in Timeout.java) the error is propagated to the app server. The app server then attempts to get the error text (I don't have the code handy) which attempts to send a request back to the Derby. This then fails with a No Connection error being returned back from Derby. It is as if after this error, the connection between the app server and Derby is no longer once there this is hit. I agree that would not be the correct behavior if a lock timeout killed the session. As this is a server side comment it would imply that this is a problem with embedded as well as well, but hard to believe it would not have been exposed before now. Thanks for working on reproduction for this. I don't see the comment in the original code import but the annotation is not clear as it mentions the back out of another fix, so I am not sure who first noticed this behavior.
Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
On 12/21/2011 6:14 AM, Bergquist, Brett wrote: I have a problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1 but at this point I cannot tell if it is a Derby problem or a Glassfish problem so I am looking for some guidance. A database pool is created using the org.apache.derby.jdbc.ClientXADataSource XADataSource. Connection validation is turned on and uses auto-commit to validate the connection and attempts to close all connections on any validation failure. The deadlock is triggered when I have a couple of transactions that cause a Derby locks deadlock. Deadlock detection and tracing is turned on in Derby and it correctly detects the locking deadlock and terminates one of the transactions. Soon after this, however, Glassfish becomes stuck. Here is the only reference to Derby in the jstack output of Glassfish: PMCollectionActivationManager.newPmTestResult daemon prio=3 tid=0x000103926800 nid=0xaf runnable [0xfffc218fa000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at org.apache.derby.client.net.Reply.fill(Unknown Source) at org.apache.derby.client.net.Reply.ensureALayerDataInBuffer(Unknown Source) at org.apache.derby.client.net.Reply.readDssHeader(Unknown Source) at org.apache.derby.client.net.Reply.startSameIdChainParse(Unknown Source) at org.apache.derby.client.net.NetXAConnectionReply.readLocalXACommit(Unknown Source) at org.apache.derby.client.net.NetXAConnection.readLocalXACommit_(Unknown Source) at org.apache.derby.client.net.NetXAConnection.readCommit(Unknown Source) at org.apache.derby.client.net.NetConnection.readXACommit_(Unknown Source) at org.apache.derby.client.am.Connection.readCommit(Unknown Source) at org.apache.derby.client.am.Connection.readAutoCommit(Unknown Source) at org.apache.derby.client.am.Connection.flowClose(Unknown Source) at org.apache.derby.client.am.Connection.closeResourcesX(Unknown Source) at org.apache.derby.client.am.Connection.closeResources(Unknown Source) - locked 0xfffc91831e18 (a org.apache.derby.client.net.NetConnection40) at org.apache.derby.client.net.NetConnection.closeResources(Unknown Source) - locked 0xfffc91831e18 (a org.apache.derby.client.net.NetConnection40) at org.apache.derby.client.ClientPooledConnection.close(Unknown Source) - locked 0xfffc91831cf8 (a org.apache.derby.client.ClientXAConnection40) at org.apache.derby.client.ClientXAConnection.close(Unknown Source) - locked 0xfffc91831cf8 (a org.apache.derby.client.ClientXAConnection40) at com.sun.gjc.spi.ManagedConnection.destroy(ManagedConnection.java:276) First thing I would suggest is move up to 10.8.2.2 just to get running on the latest. I wonder is there indication in the derby.log what went wrong here? If not, I would suggest enabling client and server side DRDA tracing to get a clue as to what went wrong with the protocol here. http://wiki.apache.org/db-derby/ProtocolDebuggingTips Maybe there is some issue with the logic that dermines if autocommit will generate flow. // precondition: autoCommit_ is true void readAutoCommit() throws SqlException { if (willAutoCommitGenerateFlow()) { readCommit(); } } public boolean willAutoCommitGenerateFlow() throws org.apache.derby.client.am.SqlException { if (!autoCommit_) { return false; } if (! allowLocalCommitRollback_()) { return false; } return true; }
RE: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
I cannot upgrade to 10.8.8.2 at this point as this is in production without upgrade possibility. A also at this point, I don't want to change to many variables and have the problem disappear. Right now I can get it to reproduce pretty readily. I was able to connect with a debug and indeed the thread is blocking on the java.net.SocketInputStream.read. Through the debugger if I force an interrupt, I returns from the socket call with an I/O error. So does this mean that the problem is in the derby server side since the socket is valid and not closed before I issue the interrupt? I am going to enable tracing and see what I find. From: Katherine Marsden [mailto:kmarsdende...@sbcglobal.net] Sent: Wednesday, December 21, 2011 11:14 AM To: derby-dev@db.apache.org Subject: Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1 On 12/21/2011 6:14 AM, Bergquist, Brett wrote: I have a problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1 but at this point I cannot tell if it is a Derby problem or a Glassfish problem so I am looking for some guidance. A database pool is created using the org.apache.derby.jdbc.ClientXADataSource XADataSource. Connection validation is turned on and uses auto-commit to validate the connection and attempts to close all connections on any validation failure. The deadlock is triggered when I have a couple of transactions that cause a Derby locks deadlock. Deadlock detection and tracing is turned on in Derby and it correctly detects the locking deadlock and terminates one of the transactions. Soon after this, however, Glassfish becomes stuck. Here is the only reference to Derby in the jstack output of Glassfish: PMCollectionActivationManager.newPmTestResult daemon prio=3 tid=0x000103926800 nid=0xaf runnable [0xfffc218fa000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at org.apache.derby.client.net.Reply.fill(Unknown Source) at org.apache.derby.client.net.Reply.ensureALayerDataInBuffer(Unknown Source) at org.apache.derby.client.net.Reply.readDssHeader(Unknown Source) at org.apache.derby.client.net.Reply.startSameIdChainParse(Unknown Source) at org.apache.derby.client.net.NetXAConnectionReply.readLocalXACommit(Unknown Source) at org.apache.derby.client.net.NetXAConnection.readLocalXACommit_(Unknown Source) at org.apache.derby.client.net.NetXAConnection.readCommit(Unknown Source) at org.apache.derby.client.net.NetConnection.readXACommit_(Unknown Source) at org.apache.derby.client.am.Connection.readCommit(Unknown Source) at org.apache.derby.client.am.Connection.readAutoCommit(Unknown Source) at org.apache.derby.client.am.Connection.flowClose(Unknown Source) at org.apache.derby.client.am.Connection.closeResourcesX(Unknown Source) at org.apache.derby.client.am.Connection.closeResources(Unknown Source) - locked 0xfffc91831e18 (a org.apache.derby.client.net.NetConnection40) at org.apache.derby.client.net.NetConnection.closeResources(Unknown Source) - locked 0xfffc91831e18 (a org.apache.derby.client.net.NetConnection40) at org.apache.derby.client.ClientPooledConnection.close(Unknown Source) - locked 0xfffc91831cf8 (a org.apache.derby.client.ClientXAConnection40) at org.apache.derby.client.ClientXAConnection.close(Unknown Source) - locked 0xfffc91831cf8 (a org.apache.derby.client.ClientXAConnection40) at com.sun.gjc.spi.ManagedConnection.destroy(ManagedConnection.java:276) First thing I would suggest is move up to 10.8.2.2 just to get running on the latest. I wonder is there indication in the derby.log what went wrong here? If not, I would suggest enabling client and server side DRDA tracing to get a clue as to what went wrong with the protocol here. http://wiki.apache.org/db-derby/ProtocolDebuggingTips Maybe there is some issue with the logic that dermines if autocommit will generate flow. // precondition: autoCommit_ is true void readAutoCommit() throws SqlException { if (willAutoCommitGenerateFlow()) { readCommit(); } } public boolean willAutoCommitGenerateFlow() throws org.apache.derby.client.am.SqlException { if (!autoCommit_) { return false; } if (! allowLocalCommitRollback_()) { return false; } return true; }
RE: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
I'm having some trouble getting client side tracing to work. The connections are managed by Glassfish connection pool so I don't know where to set the traceDirectory and traceLevel properties. Can these be specified as properties like the password, etc. Just a little more info as well. Once of the threads got this error: [#|2011-12-21T10:28:54.224-0800|INFO|sun-appserver2.1|javax.enterprise.system.container.ejb|_ThreadID=20;_ThreadName=httpSSLWorkerThread-8080-0;| javax.ejb.TransactionRolledbackLocalException: Exception thrown from bean; nested exception is: Exception [EclipseLink-4002] (Eclipse Persistence Services - 1.1.3.v20091002-r5404): org.eclipse.persistence.exceptions.DatabaseException Internal Exception: java.sql.SQLTransactionRollbackException: DERBY SQL error: SQLCODE: -1, SQLSTATE: 40XL2, SQLERRMC: There was a long time where a lock could not be obtained and then next thing reported by the thread is: [#|2011-12-21T10:28:54.269-0800|WARNING|sun-appserver2.1|javax.enterprise.system.core.transaction|_ThreadID=20;_ThreadName=httpSSLWorkerThread-8080-0;_RequestID=5e6bd3ac-516f-4d96-8018-baf3ce187f69;|JTS5064: Unexpected exception occurred while delisting the resource org.apache.derby.client.am.XaException: XAER_RMFAIL : No current connection. at org.apache.derby.client.net.NetXAResource.throwXAException(NetXAResource.java:766) at org.apache.derby.client.net.NetXAResource.throwXAException(NetXAResource.java:663) at org.apache.derby.client.net.NetXAResource.connectionClosedFailure(NetXAResource.java:933) at org.apache.derby.client.net.NetXAResource.end(NetXAResource.java:230) at com.sun.gjc.spi.XAResourceImpl.end(XAResourceImpl.java:100) at com.sun.jts.jta.TransactionState.rollback(TransactionState.java:199) at com.sun.jts.jtsxa.OTSResourceImpl.rollback(OTSResourceImpl.java:311) And then later: [#|2011-12-21T10:28:54.380-0800|SEVERE|sun-appserver2.1|javax.enterprise.system.core.transaction|_ThreadID=20;_ThreadName=httpSSLWorkerThread-8080-0;org.omg.CORBA.INTERNAL: vmcid: 0x0 minor code: 0 completed: Maybe;rollback;_RequestID=5e6bd3ac-516f-4d96-8018-baf3ce187f69;|JTS5031: Exception [org.omg.CORBA.INTERNAL: vmcid: 0x0 minor code: 0 completed: Maybe] on Resource [rollback] operation.|#] And then later: [#|2011-12-21T10:28:54.382-0800|WARNING|sun-appserver2.1|javax.enterprise.resource.resourceadapter|_ThreadID=20;_ThreadName=httpSSLWorkerThread-8080-0;_RequestID=5e6bd3ac-516f-4d96-8018-baf3ce187f69;|RAR7115: Unable to set ClientInfo for connection java.sql.SQLClientInfoException: No current connection. at org.apache.derby.client.net.NetConnection40.setClientInfo(NetConnection40.java:274) at org.apache.derby.client.am.LogicalConnection40.setClientInfo(LogicalConnection40.java:254) at com.sun.gjc.spi.jdbc40.ConnectionHolder40.setClientInfo(ConnectionHolder40.java:304) at com.sun.gjc.spi.jdbc40.ConnectionHolder40.close(ConnectionHolder40.java:513) at org.eclipse.persistence.internal.databaseaccess.DatabaseAccessor.closeDatasourceConnection(DatabaseAccessor.java:450) at org.eclipse.persistence.internal.databaseaccess.DatasourceAccessor.closeConnection(DatasourceAccessor.java:488) Another thread saw: [#|2011-12-21T10:28:54.315-0800|INFO|sun-appserver2.1|javax.enterprise.system.container.ejb|_ThreadID=75;_ThreadName=httpSSLWorkerThread-8080-3;| javax.ejb.TransactionRolledbackLocalException: Exception thrown from bean; nested exception is: Exception [EclipseLink-4002] (Eclipse Persistence Services - 1.1.3.v20091002-r5404): org.eclipse.persistence.exceptions.DatabaseException Internal Exception: java.sql.SQLTransactionRollbackException: DERBY SQL error: SQLCODE: -1, SQLSTATE: 40XL2, SQLERRMC: Wed Dec 21 10:28:54 PST 2011 And then saw: [#|2011-12-21T10:28:54.326-0800|WARNING|sun-appserver2.1|javax.enterprise.system.core.transaction|_ThreadID=75;_ThreadName=httpSSLWorkerThread-8080-3;_RequestID=b2c68a79-1eb1-497d-aab5-f3875f656ed4;|JTS5064: Unexpected exception occurred while delisting the resource java.lang.NullPointerException at org.apache.derby.client.net.Request.sendBytes(Request.java:1289) at org.apache.derby.client.net.Request.flush(Request.java:1283) at org.apache.derby.client.net.NetAgent.sendRequest(NetAgent.java:393) at org.apache.derby.client.net.NetAgent.flush_(NetAgent.java:273) at org.apache.derby.client.am.Agent.flowOutsideUOW(Agent.java:196) at org.apache.derby.client.net.NetXAResource.end(NetXAResource.java:242) at com.sun.gjc.spi.XAResourceImpl.end(XAResourceImpl.java:100) and then saw:
Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
On 12/21/2011 8:34 AM, Bergquist, Brett wrote: I cannot upgrade to 10.8.8.2 at this point as this is in production without upgrade possibility. A also at this point, I don't want to change to many variables and have the problem disappear. Right now I can get it to reproduce pretty readily. I was able to connect with a debug and indeed the thread is blocking on the java.net.SocketInputStream.read. Through the debugger if I force an interrupt, I returns from the socket call with an I/O error. So does this mean that the problem is in the derby server side since the socket is valid and not closed before I issue the interrupt? If you are not seeing the expected SYNCCRD response from the server in your trace, I would expect it is either a) A problem on the server side that it is not being sent or b) a problem on the client side in terms of determining wither this should flow. Was there anything of interest in the derby.log? I would suggest you go ahead and file a Jira issue so you can attach the trace or trace excerpts and reproduction when you get one. Please put a pointer to this email thread in there. Thanks Kathey
Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
On 12/21/2011 11:20 AM, Bergquist, Brett wrote: I'm having some trouble getting client side tracing to work. The connections are managed by Glassfish connection pool so I don't know where to set the traceDirectory and traceLevel properties. Can these be specified as properties like the password, etc. They can be set on the connection URL or with undocumented system properties, documented here #:) http://wiki.apache.org/db-derby/UndocumentedDerbyBehavior Looking at the info, again I am curious if there are corresponding server side traces in the derby.log. Also it would be interesting to see if there are at this point any XA Transactions in need of recovery in the database. Just exit your application and connect with ij and run: select * from SYSCS_DIAG.TRANSACTION_TABLE ;
RE: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
Nothing in the Derby log other than it logging a deadlock with the statements and a lock timeout with its statements and it indicating that cleanup had started and completed. I will enable tracing with the documented (undocumented system property). Thanks for that information! I will check for the XA transactions the next time I reproduce this. Maybe you could point me into the correct area to look. This seems to be triggered either through a lock timeout or a deadlock. The connection that this is occurring through is an XA connection. I see the logging of this in the server log but I am trying to find out where that would be logged from. It seems after this occurs and because of the way connection pool is being validated and recreated on error by Glassfish (configured to do so), it gets into this state. What I don't understand is why this type of error would cause the connection to appear to be invalid and I am trying to work through both the Glassfish source and the Derby source to find out. The connection is correctly handling other errors such as a duplication trying to be inserted and this does not trigger the connection to appear to be invalid.So I am trying to understand why a lock timeout or deadlock detection might do so. This problem has only cropped up recently when they started performing multiple requests that I know have a deadlock path through them. I can fix that problem later but this is a system level problem that I need to resolve. I really do appreciate the help and guidance and am willing to try to work though this. I have to figure this out and either patch Glassfish or Derby in any case as my customer (think very very large wireless carrier) is getting pretty PO'ed. From: Katherine Marsden [mailto:kmarsdende...@sbcglobal.net] Sent: Wednesday, December 21, 2011 2:46 PM To: derby-dev@db.apache.org Subject: Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1 On 12/21/2011 11:20 AM, Bergquist, Brett wrote: I'm having some trouble getting client side tracing to work. The connections are managed by Glassfish connection pool so I don't know where to set the traceDirectory and traceLevel properties. Can these be specified as properties like the password, etc. They can be set on the connection URL or with undocumented system properties, documented here #:) http://wiki.apache.org/db-derby/UndocumentedDerbyBehavior Looking at the info, again I am curious if there are corresponding server side traces in the derby.log. Also it would be interesting to see if there are at this point any XA Transactions in need of recovery in the database. Just exit your application and connect with ij and run: select * from SYSCS_DIAG.TRANSACTION_TABLE ;
RE: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
After shutting down the application server but leaving the database engine running, it still has transactions: 6789036870 (4871251,2b0425fa2e6273756e332d7369743233372c7365727665722c5033373030,73756e332d7369743233372c7365727665722c50333730302c00) CSEMUserTransaction ACTIVE (1735812,468154) null 6789064593 (4871251,151225fa2e6273756e332d7369743233372c7365727665722c5033373030,73756e332d7369743233372c7365727665722c50333730302c00) CSEMUserTransaction ACTIVE null select max(csid) from ( select min(cs.id) as csid, min(cs.configuration_number) as csnum, cbe.id as cbid from --DERBY-PROPERTIES joinOrder=FIXED/n core_v1.configurable_hardware ch join core_v1.configuration_set cs on cs.configurable_hardware_id = ch.id join core_v1.configurationset_configurationbundle cscb on cscb.configurationset_id = cs.id join core_v1.configuration_bundle cb on cb.id = cscb.configurationbundle_id join pkg_9145e10g.configuration_bundle_9145e10g cbe on cbe.id = cb.id where ch.id = ? and cb.bundle_name = ? group by cbe.id ) as lcs 6789041545 (4871251,230625fa2e6273756e332d7369743233372c7365727665722c5033373030,73756e332d7369743233372c7365727665722c50333730302c00) CSEMUserTransaction ACTIVE null select max(csid) from ( select min(cs.id) as csid, min(cs.configuration_number) as csnum, cbe.id as cbid from --DERBY-PROPERTIES joinOrder=FIXED/n core_v1.configurable_hardware ch join core_v1.configuration_set cs on cs.configurable_hardware_id = ch.id join core_v1.configurationset_configurationbundle cscb on cscb.configurationset_id = cs.id join core_v1.configuration_bundle cb on cb.id = cscb.configurationbundle_id join pkg_9145e10g.configuration_bundle_9145e10g cbe on cbe.id = cb.id where ch.id = ? and cb.bundle_name = ? group by cbe.id ) as lcs 6789042080 null CSEMUserTransaction ACTIVE null SELECT t0.ID, t0.DTYPE, t0.BUNDLE_NAME, t0.OPLOCK, t1.ID, t2.ID, t2.PM_END_DELAY, t2.PM_BETWEEN_TIME, t2.PM_DIS_OVER_TIME, t2.PM_DIS_END_DELAY, t2.PM_SCHEDULER_POLICY, t2.PM_SCHEDULER_STATE FROM CORE_V1.CONFIGURATIONSET_CONFIGURATIONBUNDLE t4, CORE_V1.CONFIGURATION_SET t3, PKG_9145E10G.PM_SCHEDULER_CONFIG_BUNDLE t2, PKG_9145E10G.CONFIGURATION_BUNDLE_9145E10G t1, CORE_V1.CONFIGURATION_BUNDLE t0 WHERE t3.ID = CAST (? AS INTEGER )) AND (t0.BUNDLE_NAME = CAST (? AS VARCHAR(32672) ))) AND (((t2.ID = t0.ID) AND (t1.ID = t0.ID)) AND (t0.DTYPE = 'PM_SCHEDULER_CONFIG_BUNDLE_9145E10G'))) AND ((t4.CONFIGURATIONBUNDLE_ID = t0.ID) AND (t3.ID = t4.CONFIGURATIONSET_ID))) From: Katherine Marsden [mailto:kmarsdende...@sbcglobal.net] Sent: Wednesday, December 21, 2011 2:46 PM To: derby-dev@db.apache.org Subject: Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1 On 12/21/2011 11:20 AM, Bergquist, Brett wrote: I'm having some trouble getting client side tracing to work. The connections are managed by Glassfish connection pool so I don't know where to set the traceDirectory and traceLevel properties. Can these be specified as properties like the password, etc. They can be set on the connection URL or with undocumented system properties, documented here #:) http://wiki.apache.org/db-derby/UndocumentedDerbyBehavior Looking at the info, again I am curious if there are corresponding server side traces in the derby.log. Also it would be interesting to see if there are at this point any XA Transactions in need of recovery in the database. Just exit your application and connect with ij and run: select * from SYSCS_DIAG.TRANSACTION_TABLE ;
RE: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
This also triggers another problem that I am having in production. If these transactions are left in this state, Derby will build thousands of transaction files in the log database. If the server is not stopped and restarted soon after these start occurring, then it may take days for Derby to boot the database. I assume that these are XA transactions that did not complete. I have Glassfish to timeout transactions after 15 minutes but this appears not to be working. I would like to set the derby.jdbc.xaTransactionTimeout property but I am leery because of https://issues.apache.org/jira/browse/DERBY-4109 From: Bergquist, Brett [mailto:bbergqu...@canoga.com] Sent: Wednesday, December 21, 2011 3:33 PM To: derby-dev@db.apache.org Subject: RE: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1 After shutting down the application server but leaving the database engine running, it still has transactions: 6789036870 (4871251,2b0425fa2e6273756e332d7369743233372c7365727665722c5033373030,73756e332d7369743233372c7365727665722c50333730302c00) CSEMUserTransaction ACTIVE (1735812,468154) null 6789064593 (4871251,151225fa2e6273756e332d7369743233372c7365727665722c5033373030,73756e332d7369743233372c7365727665722c50333730302c00) CSEMUserTransaction ACTIVE null select max(csid) from ( select min(cs.id) as csid, min(cs.configuration_number) as csnum, cbe.id as cbid from --DERBY-PROPERTIES joinOrder=FIXED/n core_v1.configurable_hardware ch join core_v1.configuration_set cs on cs.configurable_hardware_id = ch.id join core_v1.configurationset_configurationbundle cscb on cscb.configurationset_id = cs.id join core_v1.configuration_bundle cb on cb.id = cscb.configurationbundle_id join pkg_9145e10g.configuration_bundle_9145e10g cbe on cbe.id = cb.id where ch.id = ? and cb.bundle_name = ? group by cbe.id ) as lcs 6789041545 (4871251,230625fa2e6273756e332d7369743233372c7365727665722c5033373030,73756e332d7369743233372c7365727665722c50333730302c00) CSEMUserTransaction ACTIVE null select max(csid) from ( select min(cs.id) as csid, min(cs.configuration_number) as csnum, cbe.id as cbid from --DERBY-PROPERTIES joinOrder=FIXED/n core_v1.configurable_hardware ch join core_v1.configuration_set cs on cs.configurable_hardware_id = ch.id join core_v1.configurationset_configurationbundle cscb on cscb.configurationset_id = cs.id join core_v1.configuration_bundle cb on cb.id = cscb.configurationbundle_id join pkg_9145e10g.configuration_bundle_9145e10g cbe on cbe.id = cb.id where ch.id = ? and cb.bundle_name = ? group by cbe.id ) as lcs 6789042080 null CSEMUserTransaction ACTIVE null SELECT t0.ID, t0.DTYPE, t0.BUNDLE_NAME, t0.OPLOCK, t1.ID, t2.ID, t2.PM_END_DELAY, t2.PM_BETWEEN_TIME, t2.PM_DIS_OVER_TIME, t2.PM_DIS_END_DELAY, t2.PM_SCHEDULER_POLICY, t2.PM_SCHEDULER_STATE FROM CORE_V1.CONFIGURATIONSET_CONFIGURATIONBUNDLE t4, CORE_V1.CONFIGURATION_SET t3, PKG_9145E10G.PM_SCHEDULER_CONFIG_BUNDLE t2, PKG_9145E10G.CONFIGURATION_BUNDLE_9145E10G t1, CORE_V1.CONFIGURATION_BUNDLE t0 WHERE t3.ID = CAST (? AS INTEGER )) AND (t0.BUNDLE_NAME = CAST (? AS VARCHAR(32672) ))) AND (((t2.ID = t0.ID) AND (t1.ID = t0.ID)) AND (t0.DTYPE = 'PM_SCHEDULER_CONFIG_BUNDLE_9145E10G'))) AND ((t4.CONFIGURATIONBUNDLE_ID = t0.ID) AND (t3.ID = t4.CONFIGURATIONSET_ID))) From: Katherine Marsden [mailto:kmarsdende...@sbcglobal.net] Sent: Wednesday, December 21, 2011 2:46 PM To: derby-dev@db.apache.org Subject: Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1 On 12/21/2011 11:20 AM, Bergquist, Brett wrote: I'm having some trouble getting client side tracing to work. The connections are managed by Glassfish connection pool so I don't know where to set the traceDirectory and traceLevel properties. Can these be specified as properties like the password, etc. They can be set on the connection URL or with undocumented system properties, documented here #:) http://wiki.apache.org/db-derby/UndocumentedDerbyBehavior Looking at the info, again I am curious if there are corresponding server side traces in the derby.log. Also it would be interesting to see if there are at this point any XA Transactions in need of recovery in the database. Just exit your application and connect with ij and run: select * from SYSCS_DIAG.TRANSACTION_TABLE ;
Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
On 12/21/2011 12:04 PM, Bergquist, Brett wrote: Nothing in the Derby log other than it logging a deadlock with the statements and a lock timeout with its statements and it indicating that cleanup had started and completed. I will enable tracing with the documented (undocumented system property). Thanks for that information! I will check for the XA transactions the next time I reproduce this. Maybe you could point me into the correct area to look. This seems to be triggered either through a lock timeout or a deadlock. The connection that this is occurring through is an XA connection. I see the logging of this in the server log but I am trying to find out where that would be logged from. It seems after this occurs and because of the way connection pool is being validated and recreated on error by Glassfish (configured to do so), it gets into this state. What I don't understand is why this type of error would cause the connection to appear to be invalid and I am trying to work through both the Glassfish source and the Derby source to find out. The connection is correctly handling other errors such as a duplication trying to be inserted and this does not trigger the connection to appear to be invalid.So I am trying to understand why a lock timeout or deadlock detection might do so. This problem has only cropped up recently when they started performing multiple requests that I know have a deadlock path through them. I can fix that problem later but this is a system level problem that I need to resolve. I really do appreciate the help and guidance and am willing to try to work though this. I have to figure this out and either patch Glassfish or Derby in any case as my customer (think very very large wireless carrier) is getting pretty PO'ed. The one thing I think of specifically with a deadlock is that it will automatically rollback the victim transaction and that might throw off this client logic regarding the state of the server.But I would think if there were just a simple problem with deadlocks it would have showed up before now. That said I don't see any specific tests in our XA tests: org.apache.derbyTesting.functionTests.tests.jdbapi.XATest or org.apache.derbyTesting.functionTests.tests.jdbcapi.XATransactionTest that test XAConnections with deadlocks. Is this a local transaction on an XA connection or a real XA transaction with two phase commit? You might want to try to test and an XAConnection with a simple deadlock case locally to see if that pops a reproduction. org.apache.derbyTesting.functionTests.tests.lang.DeadlockDetectionTest and org.apache.derbyTesting.functionTests.tests.lang have some examples of deadlocks. HTH, Kathey
Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
On 12/21/2011 12:40 PM, Bergquist, Brett wrote: This also triggers another problem that I am having in production. If these transactions are left in this state, Derby will build thousands of transaction files in the log database. If the server is not stopped and restarted soon after these start occurring, then it may take days for Derby to boot the database. Uncompleted XA transactions will persist even after the the server is stopped and have to be recovered XAResource.recover() and told explicitly whether to commit or rollback. An example of using XAResource.recover() is in XATest testInterleavingTransactions(). Once the core problem is determined, you should run an xa recovery operation to clean up the residual transactions. I assume that these are XA transactions that did not complete. I have Glassfish to timeout transactions after 15 minutes but this appears not to be working. I would like to set the derby.jdbc.xaTransactionTimeout property but I am leery because of https://issues.apache.org/jira/browse/DERBY-4109 I can't really say to much about DERBY-4109 as we don't have a reproduction and I don't understand the circumstances under which it occurs, but in your case, I tend to think extending timeout is not going to help as it seems you have hit a protocol exchange where the server will never respond to the client. Looking forward to seeing the excerpts from the client and server traces after the deadlock and hearing how your simple deadlock test goes. If you need a simple standalone template for XA to start from, I think ReproDerby1016.java attached to DERBY-1016 might work. Thanks Kathey
Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
On 12/21/2011 1:31 PM, Katherine Marsden wrote: One more random bit of information for your test case. I recall when I needed to add an XA Specific test case adding the code below to SavepointJdbc30Test to get the existing test to run as well with XA. You might try doing that with the existing deadlock test cases to pop the issue. I don't think though that it tests two phase commit. // Repeat the embedded tests obtaining a connection from // an XA data source if it is supported. This is not supported // under JSR169. if (JDBC.vmSupportsJDBC3()) { embedded = new TestSuite( SavepointJdbc30_JSR169Test:embedded XADataSource); embedded.addTestSuite(SavepointJdbc30Test.class); embedded.addTest(getEmbeddedSuite(SavepointJdbc30_JSR169Test: + embedded only XADataSource)); suite.addTest(TestConfiguration.connectionXADecorator(embedded)); // Repeat the client tests obtaining a connection from // an XA data source if it is supported. This is not supported // under JSR169. client = new TestSuite(SavepointJdbc30_JSR169Test:client XADatasource); client.addTestSuite(SavepointJdbc30Test.class); suite.addTest(TestConfiguration.clientServerDecorator(TestConfiguration.connectionXADecorator(client))); } It would be neat to have modes in which all the tests could be run using different types of connections, etc. https://issues.apache.org/jira/browse/DERBY-1134 Kathey
RE: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
Will get to this tomorrow but I do see one comment in the code that I don't understand: In DRDAConnThread.java, I see: if (severity CodePoint.SVRCOD_ERROR) { // For a session ending error CodePoint.SRVCOD_ERROR you cannot // send a SQLERRRM. A CMDCHKRM is required. In XA if there is a // lock timeout it ends the whole session. I am not sure this // is the correct behaviour but if it occurs we have to send // a CMDCHKRM instead of SQLERRM writeCMDCHKRM(severity); } So what does the comment In XA if there is a lock timeout it ends the whole session refer to. Why would a lock timeout be any different than any other standard database error. It is like this is hinting at what is happening. This is a real XA transaction. What I see is that after the timeout is hit (I see it hit in Timeout.java) the error is propagated to the app server. The app server then attempts to get the error text (I don't have the code handy) which attempts to send a request back to the Derby. This then fails with a No Connection error being returned back from Derby. It is as if after this error, the connection between the app server and Derby is no longer once there this is hit. I am going to go try to follow through the code and get a smaller reproducible example. From: Katherine Marsden [mailto:kmarsdende...@sbcglobal.net] Sent: Wednesday, December 21, 2011 3:41 PM To: derby-dev@db.apache.org Subject: Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1 On 12/21/2011 12:04 PM, Bergquist, Brett wrote: Nothing in the Derby log other than it logging a deadlock with the statements and a lock timeout with its statements and it indicating that cleanup had started and completed. I will enable tracing with the documented (undocumented system property). Thanks for that information! I will check for the XA transactions the next time I reproduce this. Maybe you could point me into the correct area to look. This seems to be triggered either through a lock timeout or a deadlock. The connection that this is occurring through is an XA connection. I see the logging of this in the server log but I am trying to find out where that would be logged from. It seems after this occurs and because of the way connection pool is being validated and recreated on error by Glassfish (configured to do so), it gets into this state. What I don't understand is why this type of error would cause the connection to appear to be invalid and I am trying to work through both the Glassfish source and the Derby source to find out. The connection is correctly handling other errors such as a duplication trying to be inserted and this does not trigger the connection to appear to be invalid.So I am trying to understand why a lock timeout or deadlock detection might do so. This problem has only cropped up recently when they started performing multiple requests that I know have a deadlock path through them. I can fix that problem later but this is a system level problem that I need to resolve. I really do appreciate the help and guidance and am willing to try to work though this. I have to figure this out and either patch Glassfish or Derby in any case as my customer (think very very large wireless carrier) is getting pretty PO'ed. The one thing I think of specifically with a deadlock is that it will automatically rollback the victim transaction and that might throw off this client logic regarding the state of the server.But I would think if there were just a simple problem with deadlocks it would have showed up before now. That said I don't see any specific tests in our XA tests: org.apache.derbyTesting.functionTests.tests.jdbapi.XATest or org.apache.derbyTesting.functionTests.tests.jdbcapi.XATransactionTest that test XAConnections with deadlocks. Is this a local transaction on an XA connection or a real XA transaction with two phase commit? You might want to try to test and an XAConnection with a simple deadlock case locally to see if that pops a reproduction. org.apache.derbyTesting.functionTests.tests.lang.DeadlockDetectionTest and org.apache.derbyTesting.functionTests.tests.lang have some examples of deadlocks. HTH, Kathey
Re: Problem with a deadlock with Derby 10.8.1.2 and Glassfish V2.1.1
On 12/21/2011 3:14 PM, Bergquist, Brett wrote: Will get to this tomorrow but I do see one comment in the code that I don't understand: In DRDAConnThread.java, I see: if (severity CodePoint.SVRCOD_ERROR) { // For a session ending error CodePoint.SRVCOD_ERROR you cannot // send a SQLERRRM. A CMDCHKRM is required. In XA if there is a // lock timeout it ends the whole session. I am not sure this // is the correct behaviour but if it occurs we have to send // a CMDCHKRM instead of SQLERRM writeCMDCHKRM(severity); } So what does the comment In XA if there is a lock timeout it ends the whole session refer to. Why would a lock timeout be any different than any other standard database error. It is like this is hinting at what is happening. This is a real XA transaction. What I see is that after the timeout is hit (I see it hit in Timeout.java) the error is propagated to the app server. The app server then attempts to get the error text (I don't have the code handy) which attempts to send a request back to the Derby. This then fails with a No Connection error being returned back from Derby. It is as if after this error, the connection between the app server and Derby is no longer once there this is hit. I agree that would not be the correct behavior if a lock timeout killed the session. As this is a server side comment it would imply that this is a problem with embedded as well as well, but hard to believe it would not have been exposed before now. Thanks for working on reproduction for this. I don't see the comment in the original code import but the annotation is not clear as it mentions the back out of another fix, so I am not sure who first noticed this behavior.