[ 
https://issues.apache.org/jira/browse/QPIDJMS-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17329143#comment-17329143
 ] 

Robbie Gemmell commented on QPIDJMS-534:
----------------------------------------

I didnt spot your initial report.

It looks like its awaiting a session creation completing, you could perhaps 
look at the jms.requestTimeout option to break it out of waiting.

One thought might be, what if anything changed in your env around your 
'recently' timeframe to provoke hitting this behaviour now (I'm assuming you 
didn't start out with a ~2 year old client). 

There have been a bunch of releases in the subsequent couple years, I'd also 
suggest trying them, as thats where any actual changes made around this would 
be going.

> BalancedProviderFuture.sync stuck forever during connection recovery
> --------------------------------------------------------------------
>
>                 Key: QPIDJMS-534
>                 URL: https://issues.apache.org/jira/browse/QPIDJMS-534
>             Project: Qpid JMS
>          Issue Type: Bug
>          Components: qpid-jms-client
>    Affects Versions: 0.42.0
>            Reporter: Ravi Nirmal
>            Priority: Major
>         Attachments: logs.txt, thread-dump.txt
>
>
> Recently, we observed an issue on our production environment where we can see 
> that BalancedProviderFuture.sync method during connection recovery is stuck 
> forever and never returns. We have observed this in 2 hosts in last one week, 
> the only solution is to restart the server.
> I am attaching the thread dump which indicates the issue and how it blocks 
> other threads, [^thread-dump.txt] will have details of all the threads.
> h3. Details of Investigation
>  * This issue is happening on connection recovery during failover from one 
> server to another.
>  * By debugging I can see that BalancedProviderFuture.sync method is waiting 
> for its state to be updated, and its state is updated by AmqpProvider thread. 
> In thread dump I don't see any AmqpProvider thread which is in stuck state 
> which indicates that AmqpProvider has done its job but still the state for 
> given BalancedProviderFuture object is not updated.
>  * In the successful event, I can see that the state of 
> BalancedProviderFuture object is updated in below sequence:
>  ** JmsSession.onConnectionRecovery method calls provider.create after 
> creating BalancedProviderFuture object.
>  ** provider.create (aka AmqpProvider.create) is start a thread using 
> serializer, this create method has proper handling and it either calls 
> pumpToProtonTransport OR request.onFailure(which will update the state of 
> BalancedProviderFuture in case of exception).
>  ** Once the above thread gets finished(basically after 
> pumpToProtonTransport), the serializer will call the AmqpProvider.onData 
> method which will update the state of BalancedProviderFuture object.
>  * I have observed that if we get the exception in AmqpProvider.onData method 
> then the state of BalancedProviderFuture is not getting updated and the 
> BalancedProviderFuture.sync method gets stuck forever, the exception can come 
> in case of protonTransport tail is closed already(probably because of idle 
> timeout issue OR any other transport related issue).
>  * I have also observed that in some cases(of idle timeout OR transport 
> errors) after completion of a thread which was started by provider.create 
> (aka AmqpProvider.create), the serializer is not calling AmqpProvider.onData 
> but instead it calls AmqpProvider.onTransportError OR 
> AmqpProvider.onTransportClosed and I can not see any handling of updating the 
> state of BalancedProviderFuture object in onTransportError OR 
> onTransportClosed method.
>  * I am attaching some [^logs.txt] which shows some errors, these error came 
> when the state of BalancedProviderFuture is not updated and sync mehod stuck 
> forever.
>  * Please note we are using URL - failover:(amqp://localhost:5672
>  ,amqp://localhost:5682)?jms.sendTimeout=5000 and qpid version 0.42.0.
> I have found two old tickets QPIDJMS-458 & QPIDJMS-464 which shows the 
> similar issue, but I believe this issue is different and might needs to be 
> fixed separately.
> Can someone please take a look at this as this becomes critical issue in our 
> production environment and we don't have any option except restart of our 
> services?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to