[ 
https://issues.apache.org/jira/browse/AMQCPP-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Pohl updated AMQCPP-646:
-------------------------------
    Attachment: iotransport.patch
                transportFilter.patch

> Segmentation fault in TransportFilter & FailoverTransportListener
> -----------------------------------------------------------------
>
>                 Key: AMQCPP-646
>                 URL: https://issues.apache.org/jira/browse/AMQCPP-646
>             Project: ActiveMQ C++ Client
>          Issue Type: Bug
>          Components: Transports
>    Affects Versions: 3.9.3, 3.9.4
>         Environment: Red Hat Enterprise 7.6
>            Reporter: Hassan Raza
>            Assignee: Timothy A. Bish
>            Priority: Critical
>         Attachments: code.26.11.2020.patch, code.patch, iotransport.patch, 
> transportFilter.patch
>
>
> 26/11/2020 UPDATE:
> We ran into segmentation faults even with our patched library, which looked 
> like this:
> -|/.../libactivemq-cpp.so.19
>  -|[18] : decaf::util::concurrent::Lock::lock()+0x1f
>  -|[19] : 
> decaf::util::concurrent::Lock::Lock(decaf::util::concurrent::Synchronizable*, 
> bool)+0x55
>  -|[20] : 
> activemq::core::ActiveMQConnection::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x104
>  -|[21] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
>  -|[22] : 
> activemq::transport::correlator::ResponseCorrelator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xac
>  -|[23] : 
> activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x305
>  -|[24] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
>  -|[25] : 
> activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6
>  -|[26] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
>  -|[27] : 
> activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x300
>  -|[28] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
>  -|[29] : 
> activemq::transport::IOTransport::fire(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9a
>  -|[30] : activemq::transport::IOTransport::run()+0xd0
>  -|[31] : +0xb7fc12
>  -|[32] : +0xb8042a
>  
> After comparing with the java equivalent of FailoverTransport, we noticed 
> that there was a discrepancy in how the disposed listener works. It seems 
> that the disposed listener is not initialized at all, which leads to the 
> transport listener not putting this fallback in place, leaving dangling 
> pointers to itself in other elements in the chain. We now ensure that there 
> is always a valid disposed listener which will serve as the sink for any 
> leftover calls. This new patch is attached.
>  
> 26/11/2020 END OF UPDATE
> ___________________________________________
> We're getting the following segmentation faults on a productive system with 
> 160 cores on a fairly regular basis:
> {quote}-|/export/..../lib/libactivemq-cpp.so.19
> {{-|[7] : 
> activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x175}}
>  {{ -|[8] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
>  {{ -|[9] : 
> activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6}}
>  {{ -|[10] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
>  {{ -|[11] : 
> activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x292}}
>  {{ -|[12] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
>  {{ -|[13] : 
> activemq::transport::IOTransport::(fire(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9c}}
>  {{ -|[14] : activemq::transport::IOTransport::run()+0xb1}}
> {quote}
> We've narrowed the problem down to the call to 
> Transport::setTransportListener(NULL) in TransportFilter::close() - it seems 
> to be the same race condition reported with:
> [https://jira.apache.org/jira/browse/AMQCPP-534]
> and
> [https://jira.apache.org/jira/browse/AMQCPP-583]
>  I'm guessing the fix will also be similar - a basic patch suggestion is 
> attached. You may want to put the static DefaultTransportListener in a common 
> place.
> However, what is more worrying is that there appear to be several race 
> conditions related to FailoverTransport::getTransportListener(). It provides 
> mutex based access to the current listener, but all subsequent access to the 
> raw pointer is unprotected. We had a concrete case in 
> FailoverTransportListener::onCommand(). There are two calls to 
> parent->getTransportListener() - one for the null ptr check, the second to 
> invoke the command. In our seg fault, we observed that the first call 
> returned a valid ptr, but the second call didn't (or returned null). Our 
> explanation is that since getting and setting the TransportListener is 
> protected by a mutex, another thread was able to get a hold of the mutex 
> between the two calls and set it to 
> FailoverTransport::setTransportListener(NULL).
> Perhaps a much better way of handling this situation would be to use 
> shared_ptr instead of a raw pointer for:
> {{FailoverTransportImpl::TransportListener* transportListener;}}
> {{std::shared_ptr<TransportListener> 
> FailoverTransport::getTransportListener()}}
> {{FailoverTransport::setTransportListener(std::shared_ptr<TransportListener> 
> listener)}}
> This way, a thread could set a new listener, and any threads still 
> referencing the old listener would gracefully fail. We looked into providing 
> a patch for that, but realized that would be a much bigger change involving 
> ActiveMQConnection deriving from std::enable_shared_from_this, which would 
> also mean replacing decaf::lang::Pointer by std::shared_ptr... Without 
> knowing why decaf::lang::Pointer exists in the first place, it would be 
> difficult to proceed without additional clarifications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to