Re: Review Request: Address deadlock btw _messageDeliveryLock and _failoverMutex

Robbie Gemmell Sat, 27 Apr 2013 05:59:22 -0700


> On April 24, 2013, 4:24 p.m., Robbie Gemmell wrote:
> > I don't really agree that there is only a slim chance a delivery can begin 
> > before the session is marked closed, it actually seems fairly likely to 
> > occur (unless none of the queues used by the consumers on the session have 
> > any messages left, like most of our tests). This could lead to a variety of 
> > issues as a result, because important portions of the client depend on 
> > being able to guarantee no more messages are being put into the consumer 
> > receive queue or onMessage() callbacks fired while they are operating. I 
> > don't get the impression these have been investigated enough to prove their 
> > correctness in the face of this change, so it doesn't seem particularly 
> > acceptable a trade-off as it stands.
> > 
> > It seems you have already noticed that there would be issues with the 
> > consumer close, since it would no longer stop deliveries occurring while 
> > closing was in progress, making it one of the cases mentioned above.
> > 
> > Given that msgDeliveriesInProgress is per-session and only 
> > incremented/decremented by the Dispatcher thread in a single place, 
> > shouldn't it just be an AtomicBoolean?
> 
> rajith attapattu wrote:
>     For the above case, the following needs to happen.
>     
>     1. A delivery is started just before session close is called, but haven't 
> got to the point where the count is incremented, by the time the close method 
> calls waitForMsgDeliveryToFinish().
>     2. There are no outstanding deliveries at the time 
> waitForMsgDeliveryToFinish() looks at "_msgDeliveriesInProgress".
>     
>     That is why I think there is a "slim" chance. If there are messages left 
> on the queue, then there is reasonable chance that _msgDeliveriesInProgress 
> is not zero. So the waitForMsgDeliveryToFinish() will block anyways.
>     I'm currently working on plugging this gap. Provided this is taken care 
> of do you see any holes in this approach ?
>     
>     As for why I think this is an acceptable trade-off, most customers prefer 
> this loop hole over a client that deadlocks every now and then.
>     This deadlock has been around for a long time with no solution and there 
> are several customers who are experiencing this deadlock in their production 
> environment.
>     
>     "I don't get the impression these have been investigated enough to prove 
> their correctness in the face of this change, so it doesn't seem particularly 
> acceptable a trade-off as it stands." -- There is ample evidence that the 
> current setup is causing several deadlocks to the point of our client being 
> useless. This solution (baring the above issue) passes all the tests, 
> including a trial run at customer environments.
>     What would you deem acceptable (provided the current issue above is 
> resolved) for you to gain confidence for the proposed fix ?
>     
>     Do you have an alternative solution? Or a way to improve the current one. 
> I'm sure we can find a way if we put our heads together.
>     The bottom line is this issue needs to be resolved, or else people will 
> loose confidence in our client.
> 
> rajith attapattu wrote:
>     "It seems you have already noticed that there would be issues with the 
> consumer close, since it would no longer stop deliveries occurring while 
> closing was in progress, making it one of the cases mentioned above." -- This 
> was identified at the outset, but wasn't included in this to ensure this 
> aspect of the solution is highlighted and debated first.
>     
>     I'm currently working on a solution to this, which I'm hoping will also 
> serve as a solution to the problem that's used in AMQSession.
>     Again suggestions and ideas are most welcomed.
> 
> Robbie Gemmell wrote:
>     That they dont fail is good, but I think we should recognise that the 
> existing tests passing is no real guarantee that the change is sufficiently 
> baked. They largely don't cover the behaviours under discussion here, and 
> haven't shown up the problems prompting the change as a result. On that 
> note..are there new tests coming that do (even if only occasionally, given 
> these are often races), in order to validate the change is effective and the 
> properly maintained going forward? 
>     
>     It didn't take long just looking at the code to spot areas of concern. 
> I'd like to see solutions or feel that the changes in behaviour have been 
> investigated and their impact sufficiently reasoned about to ensure the 
> continued correctness of other areas in the client that might depended on the 
> previous behaviour...e.g the consumer close.
>     
>     It seems like compareAndSet operations on the delivery tracking object 
> could be used to extend the proposed change and impede forward progress of 
> the threads (without deadlock) while their desired condition is not met, 
> rather than just using the current value to make an independent decision 
> which would potentially allow the various threads to get into an undesirable 
> state.
>     
>     I forgot to say in my original reply, you mentioned not holding up 
> session close due to an in-progress onMessage delivery but that is actually 
> required by JMS: "This call will block until a receive call or message 
> listener in progress has completed."
>     
>     ..and with that, its now definitely after-work o'clock :)
> 
> rajith attapattu wrote:
>     I'm trying to see if I can add a test case to our test suite to verify 
> this. But as you said, these are hard to automate given their nature. The 
> testing we've done so far manual. We've got reproducers for all the deadlock 
> cases.
>     
>     "It didn't take long just looking at the code to spot areas of concern" 
> -- Other than the consumer close issue, are there any other concerns you had? 
>     As mentioned these issues were identified at the outset. This patch is 
> posted to debate the merits of the approach and/or as a stage to come up with 
> an even better approach.
>     
>     "but that is actually required by JMS: "This call will block until a 
> receive call or message listener in progress has completed." -- This can be 
> simply done by removing the timeout.
>     Perhaps we should make this configurable, the default being what JMS 
> specifies. The timeout is handy in a lot of cases, where you want to recover 
> from wedged onMessage() methods.

"I'm trying to see if I can add a test case to our test suite to verify this. 
But as you said, these are hard to automate given their nature. The testing 
we've done so far manual. We've got reproducers for all the deadlock cases."
It might be hard to automate a test which consistently fails but it shouldn't 
be too hard to automate something that fails sporadically, which would still be 
infinitely better than tests which never show the issue at all.

"Other than the consumer close issue, are there any other concerns you had? "
I still think that the 'delivering messages after session close has begun' 
thing is an issue, and I doubt the clients ability to handle it well when it 
occurs. If that were to be left a possability then I'd at least like to see 
some reasoned investigation into what happens when it occurs and details as to 
why it shouldn't cause concern. Things that spring to mind: what happens when 
onMessage completes if it is auto-ack? what happens if the user tries to use 
the session during onMessage and its now closed? Etc. There was also a deadlock 
previously that I think got resolved by use of the messageDeliveryLock (session 
rollback and consumer close or something like that), so I'd be interested to 
know if that becomes an issue again with its removal or not.

"As mentioned these issues were identified at the outset. This patch is posted 
to debate the merits of the approach and/or as a stage to come up with an even 
better approach."
I originally looked at the patch long before it was mentioned actually, I just 
hadn't found time to post the comments until after the review was updated.

"This can be simply done by removing the timeout. Perhaps we should make this 
configurable, the default being what JMS specifies. The timeout is handy in a 
lot of cases, where you want to recover from wedged onMessage() methods."
I think anyone needing to use that has application level issues they proably 
want to look at handling themselves, and I doubt how well the client would 
behave after the situation came into effect, but so long as the default was the 
required/proper behaviour I guess I could live with it being configurable.

- Robbie

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/10738/#review19626
-----------------------------------------------------------

On April 24, 2013, 12:19 p.m., rajith attapattu wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/10738/
> -----------------------------------------------------------
> 
> (Updated April 24, 2013, 12:19 p.m.)
> 
> 
> Review request for qpid, Robbie Gemmell, Weston Price, and Rob Godfrey.
> 
> 
> Description
> -------
> 
> There are at least 3 cases where the deadlock btw _messageDeliveryLock and 
> _failoverMutex surfaces. Among them sending a message inside onMessage() and 
> the session being closed due to an error (causing the deadlock) seems to come 
> up a lot in production environments. There is also a deadlock btw 
> _messageDeliveryLock and _lock (AMQSession.java) which happens less 
> frequently.
> The messageDeliveryLock is used to ensure that we don't close the session in 
> the middle of a message delivery. In order to do this we hold the lock across 
> onMessage().
> This causes several issues in addition to the potential to deadlock. If an 
> onMessage call takes longer/wedged then you cannot close the session or 
> failover will not happen until it returns as the same thread is holding the 
> failoverMutex. 
> 
> Based on an idea from Rafi, I have come up with a solution to get rid of 
> _messageDeliveryLock and instead use an alternative strategy to achieve 
> similar functionality.
> In order to ensure that close() doesn't proceed until the message deliveries 
> currently in progress completes, an atomic counter is used to keep track of 
> message deliveries in progress.
> The close() will wait until the count falls to zero before proceeding. No new 
> deliveries will be initiated bcos the close method will mark the session as 
> closed.
> The wait has a timeout to ensure that a longer running or wedged onMessage() 
> will not hold up session close.
> There is a slim chance that before a session being marked as closed a message 
> delivery could be initiated, but not yet gotten to the point of updating the 
> counter, hence waitForMsgDeliveryToFinish() will see it as zero and proceed 
> with close. But in comparison to the issues with _messageDeliveryLock, I 
> believe it's acceptable.
> 
> There is an issue if MessageConsumer close is called outside of Session 
> close. This can be solved in a similar manner. I will wait until the current 
> review is complete and then post the solution for the MessageConsumer close.
> I will commit them both together.
> 
> 
> This addresses bug QPID-4574.
>     https://issues.apache.org/jira/browse/QPID-4574
> 
> 
> Diffs
> -----
> 
>   
> http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQConnection.java
>  1471133 
>   
> http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/AMQSession.java
>  1471133 
>   
> http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/client/src/main/java/org/apache/qpid/client/BasicMessageConsumer.java
>  1471133 
>   
> http://svn.apache.org/repos/asf/qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/configuration/ClientProperties.java
>  1471133 
> 
> Diff: https://reviews.apache.org/r/10738/diff/
> 
> 
> Testing
> -------
> 
> Java test suite, tests from customers and QE around the deadlock situation.
> 
> 
> Thanks,
> 
> rajith attapattu
> 
>

Re: Review Request: Address deadlock btw _messageDeliveryLock and _failoverMutex

Reply via email to