[
https://issues.apache.org/jira/browse/AMQCPP-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Pohl updated AMQCPP-646:
-------------------------------
Attachment: iotransport.patch
transportFilter.patch
> Segmentation fault in TransportFilter & FailoverTransportListener
> -----------------------------------------------------------------
>
> Key: AMQCPP-646
> URL: https://issues.apache.org/jira/browse/AMQCPP-646
> Project: ActiveMQ C++ Client
> Issue Type: Bug
> Components: Transports
> Affects Versions: 3.9.3, 3.9.4
> Environment: Red Hat Enterprise 7.6
> Reporter: Hassan Raza
> Assignee: Timothy A. Bish
> Priority: Critical
> Attachments: code.26.11.2020.patch, code.patch, iotransport.patch,
> transportFilter.patch
>
>
> 26/11/2020 UPDATE:
> We ran into segmentation faults even with our patched library, which looked
> like this:
> -|/.../libactivemq-cpp.so.19
> -|[18] : decaf::util::concurrent::Lock::lock()+0x1f
> -|[19] :
> decaf::util::concurrent::Lock::Lock(decaf::util::concurrent::Synchronizable*,
> bool)+0x55
> -|[20] :
> activemq::core::ActiveMQConnection::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0x104
> -|[21] :
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
> -|[22] :
> activemq::transport::correlator::ResponseCorrelator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0xac
> -|[23] :
> activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0x305
> -|[24] :
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
> -|[25] :
> activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6
> -|[26] :
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
> -|[27] :
> activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0x300
> -|[28] :
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
> -|[29] :
> activemq::transport::IOTransport::fire(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9a
> -|[30] : activemq::transport::IOTransport::run()+0xd0
> -|[31] : +0xb7fc12
> -|[32] : +0xb8042a
>
> After comparing with the java equivalent of FailoverTransport, we noticed
> that there was a discrepancy in how the disposed listener works. It seems
> that the disposed listener is not initialized at all, which leads to the
> transport listener not putting this fallback in place, leaving dangling
> pointers to itself in other elements in the chain. We now ensure that there
> is always a valid disposed listener which will serve as the sink for any
> leftover calls. This new patch is attached.
>
> 26/11/2020 END OF UPDATE
> ___________________________________________
> We're getting the following segmentation faults on a productive system with
> 160 cores on a fairly regular basis:
> {quote}-|/export/..../lib/libactivemq-cpp.so.19
> {{-|[7] :
> activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0x175}}
> {{ -|[8] :
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
> {{ -|[9] :
> activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6}}
> {{ -|[10] :
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
> {{ -|[11] :
> activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0x292}}
> {{ -|[12] :
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
> {{ -|[13] :
> activemq::transport::IOTransport::(fire(decaf::lang::Pointer<activemq::commands::Command,
> decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9c}}
> {{ -|[14] : activemq::transport::IOTransport::run()+0xb1}}
> {quote}
> We've narrowed the problem down to the call to
> Transport::setTransportListener(NULL) in TransportFilter::close() - it seems
> to be the same race condition reported with:
> [https://jira.apache.org/jira/browse/AMQCPP-534]
> and
> [https://jira.apache.org/jira/browse/AMQCPP-583]
> I'm guessing the fix will also be similar - a basic patch suggestion is
> attached. You may want to put the static DefaultTransportListener in a common
> place.
> However, what is more worrying is that there appear to be several race
> conditions related to FailoverTransport::getTransportListener(). It provides
> mutex based access to the current listener, but all subsequent access to the
> raw pointer is unprotected. We had a concrete case in
> FailoverTransportListener::onCommand(). There are two calls to
> parent->getTransportListener() - one for the null ptr check, the second to
> invoke the command. In our seg fault, we observed that the first call
> returned a valid ptr, but the second call didn't (or returned null). Our
> explanation is that since getting and setting the TransportListener is
> protected by a mutex, another thread was able to get a hold of the mutex
> between the two calls and set it to
> FailoverTransport::setTransportListener(NULL).
> Perhaps a much better way of handling this situation would be to use
> shared_ptr instead of a raw pointer for:
> {{FailoverTransportImpl::TransportListener* transportListener;}}
> {{std::shared_ptr<TransportListener>
> FailoverTransport::getTransportListener()}}
> {{FailoverTransport::setTransportListener(std::shared_ptr<TransportListener>
> listener)}}
> This way, a thread could set a new listener, and any threads still
> referencing the old listener would gracefully fail. We looked into providing
> a patch for that, but realized that would be a much bigger change involving
> ActiveMQConnection deriving from std::enable_shared_from_this, which would
> also mean replacing decaf::lang::Pointer by std::shared_ptr... Without
> knowing why decaf::lang::Pointer exists in the first place, it would be
> difficult to proceed without additional clarifications.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)