[jira] [Commented] (DERBY-6879) Engine deadlock between XA timeout handling and cleanupOnError

Brett Bergquist (JIRA) Sun, 17 Jul 2016 17:47:13 -0700

    [ 
https://issues.apache.org/jira/browse/DERBY-6879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15381594#comment-15381594
 ]


Brett Bergquist commented on DERBY-6879:
----------------------------------------

I was able to get this to reproduce periodically by causing stress on my test 
system while running the tests over and over.   

There is a timing issue that is opened up because of the lack of 
synchronization on the XATransactionState inside of 'cancel' while calling the 
"conn.rollback".  

 I have taken a different approach for a fix.  

The problem occurs if a timeout occurs calling "cancel" and if an error occurs 
on the clients connection causing the "cleanupOnError" to be called at the same 
time.  Recognizing this, the patch in the "cancel" method checks to see if the 
"cleanUpOnError" is being invoked and if so, the cancel is skipped.  This makes 
sense as if "cleanupOnError" is being called, then the transaction will end 
anyways as there the error handling code on the client is being processed, so 
it does not need to be cancelled.

The patch also adds a check in the "cleanupOnError" method to check to see if 
the "cancel" is being invoked.  If so, then the cleanup on the 
XATransactionState by this method is skipped.   this makes sense as if the 
transaction is being cancelled, then there is no need to mark the 
XATransactionState with the cleanup error information.

The patch also disassociates the XID from ResourceAdapter earlier in the 
"cancel".   The logic behind this is that once the "cancel" starts processing 
any client code that access the XA transaction, really should not see the XA 
transaction.   So a call to  "XATransaction.end" for example, after the XA 
transaction times out and is cancelled, the client code is going to received a 
XAException.XAER_NOTA.   Note that this is not a change in except in timing of 
where the XID is removed in "cancel" in that it previously was done at the end 
of "cancel" and now it is being done in the beginning of "cancel".  This 
eliminate any possibly that client code can access a XA transaction after it is 
starting to be cancelled.  Also note that this does not change the error that a 
client would receive if the XA transaction were cancelled and then seconds 
later after the cancel completed, the client were to access the XA transaction; 
it would receive a XAException.XAER_NOTA.

The patch creates a new private static class that is used to track if "cancel" 
or "cleanupOnError" has been invoked.  The methods are synchronized so that 
there is no timing issue on checking and recording the state.



> Engine deadlock between XA timeout handling and cleanupOnError
> --------------------------------------------------------------
>
>                 Key: DERBY-6879
>                 URL: https://issues.apache.org/jira/browse/DERBY-6879
>             Project: Derby
>          Issue Type: Bug
>          Components: Services
>    Affects Versions: 10.10.2.0
>         Environment: Solaris 10.5 on Oracle M5000 
>            Reporter: Brett Bergquist
>         Attachments: derby-6879-2016-07-05.diff, derby-6879-2016-07-08.diff, 
> derby-6879-test.diff, svnstatus.txt, testFail.zip
>
>
> Deadlock between XA timer cleanup task and the ContextManager.cleanupOnError
> Found one Java-level deadlock:
> =============================
> "DRDAConnThread_34":
>   waiting to lock monitor 0x0000000104b14d18 (object 0xfffffffd9090f058, a 
> org.apache.derby.jdbc.XATransactionState),
>   which is held by "Timer-0"
> "Timer-0":
>   waiting to lock monitor 0x00000001038b96e8 (object 0xfffffffd9090d8b0, a 
> org.apache.derby.impl.jdbc.EmbedConnection40),
>   which is held by "DRDAConnThread_34"
>  
> Java stack information for the threads listed above:
> ===================================================
> "DRDAConnThread_34":
>      at org.apache.derby.jdbc.XATransactionState.cleanupOnError(Unknown 
> Source)
>      - waiting to lock <0xfffffffd9090f058> (a 
> org.apache.derby.jdbc.XATransactionState)
>      at 
> org.apache.derby.iapi.services.context.ContextManager.cleanupOnError(Unknown 
> Source)
>      at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.cleanupOnError(Unknown 
> Source)
>      at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>      at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>      at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>      at org.apache.derby.impl.jdbc.EmbedStatement.executeStatement(Unknown 
> Source)
>      - locked <0xfffffffd9090d8b0> (a 
> org.apache.derby.impl.jdbc.EmbedConnection40)
>      at 
> org.apache.derby.impl.jdbc.EmbedPreparedStatement.executeStatement(Unknown 
> Source)
>      at org.apache.derby.impl.jdbc.EmbedPreparedStatement.execute(Unknown 
> Source)
>      at org.apache.derby.iapi.jdbc.BrokeredPreparedStatement.execute(Unknown 
> Source)
>      at org.apache.derby.impl.drda.DRDAStatement.execute(Unknown Source)
>      at 
> org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTTobjects(Unknown 
> Source)
>      at org.apache.derby.impl.drda.DRDAConnThread.parseEXCSQLSTT(Unknown 
> Source)
>      at org.apache.derby.impl.drda.DRDAConnThread.processCommands(Unknown 
> Source)
>      at org.apache.derby.impl.drda.DRDAConnThread.run(Unknown Source)
> "Timer-0":
>      at org.apache.derby.impl.jdbc.EmbedConnection.xa_rollback(Unknown Source)
>      - waiting to lock <0xfffffffd9090d8b0> (a 
> org.apache.derby.impl.jdbc.EmbedConnection40)
>      at org.apache.derby.jdbc.XATransactionState.cancel(Unknown Source)
>      - locked <0xfffffffd9090f058> (a 
> org.apache.derby.jdbc.XATransactionState)
>      at 
> org.apache.derby.jdbc.XATransactionState$CancelXATransactionTask.run(Unknown 
> Source)
>      at java.util.TimerThread.mainLoop(Timer.java:555)
>      at java.util.TimerThread.run(Timer.java:505)
>  
> Found 1 deadlock.
> This deadlock caused Derby to create 18000 transaction recovery logs because 
> of the XA transaction that did not cleanup in the timeout.  Rebooting the 
> system would cause a 50 hour boot up time to process the transaction logs so 
> recovery had to be done by going to a backup database before the issue 
> occurred.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DERBY-6879) Engine deadlock between XA timeout handling and cleanupOnError

Reply via email to