Hassan Raza created AMQCPP-646:
----------------------------------
Summary: Segmentation fault in TransportFilter &
FailoverTransportListener
Key: AMQCPP-646
URL: https://issues.apache.org/jira/browse/AMQCPP-646
Project: ActiveMQ C++ Client
Issue Type: Bug
Components: Transports
Affects Versions: 3.9.4, 3.9.3
Environment: We're using Red Hat Enterprise 7.6
Reporter: Hassan Raza
Assignee: Timothy Bish
Attachments: code.patch
We're getting the following segmentation faults on a productive system with 160
cores on a fairly regular basis:
{quote}-|/export/..../lib/libactivemq-cpp.so.19
{{-|[7] :
activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command,
decaf::util::concurrent::atomic::AtomicRefCounter>)+0x175}}
{{ -|[8] :
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
{{ -|[9] :
activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6}}
{{ -|[10] :
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
{{ -|[11] :
activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command,
decaf::util::concurrent::atomic::AtomicRefCounter>)+0x292}}
{{ -|[12] :
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
{{ -|[13] :
activemq::transport::IOTransport::(fire(decaf::lang::Pointer<activemq::commands::Command,
decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9c}}
{{ -|[14] : activemq::transport::IOTransport::run()+0xb1}}
{quote}
We've narrowed the problem down to the call to
Transport::setTransportListener(NULL) in TransportFilter::close() - it seems to
be the same race condition reported with:
https://jira.apache.org/jira/browse/AMQCPP-534
and
https://jira.apache.org/jira/browse/AMQCPP-583
I'm guessing the fix will also be similar - a basic patch suggestion is
attached. You may want to put the static DefaultTransportListener in a common
place.
However, what is more worrying is that there appear to be several race
conditions related to FailoverTransport::getTransportListener(). It provides
mutex based access to the current listener, but all subsequent access to the
raw pointer is unprotected. We had a concrete case in
FailoverTransportListener::onCommand(). There are two calls to
parent->getTransportListener() - one for the null ptr check, the second to
invoke the command. In our seg fault, we observed that the first call returned
a valid ptr, but the second call didn't (or returned null). Our explanation is
that since getting and setting the TransportListener is protected by a mutex,
another thread was able to get a hold of the mutex between the two calls and
set it to FailoverTransport::setTransportListener(NULL).
Perhaps a much better way of handling this situation would be to use shared_ptr
instead of a raw pointer for:
{{FailoverTransportImpl::TransportListener*
transportListener;}}{{std::shared_ptr<TransportListener>
FailoverTransport::getTransportListener()}}{{FailoverTransport::setTransportListener(std::shared_ptr<TransportListener>
listener)}}
This way, a thread could set a new listener, and any threads still referencing
the old listener would gracefully fail. We looked into providing a patch for
that, but realized that would be a much bigger change involving
ActiveMQConnection deriving from std::enable_shared_from_this, which would also
mean replacing decaf::lang::Pointer by std::shared_ptr... Without knowing why
decaf::lang::Pointer exists in the first place, it would be difficult to
proceed without additional clarifications.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)