[jira] [Updated] (AMQCPP-646) Segmentation fault in TransportFilter & FailoverTransportListener

Hassan Raza (Jira) Thu, 26 Nov 2020 04:49:34 -0800


     [ 
https://issues.apache.org/jira/browse/AMQCPP-646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hassan Raza updated AMQCPP-646:
-------------------------------
    Description: 
26/11/2020 UPDATE:

We ran into segmentation faults even with our patched library, which looked 
like this:

-|/apps/xentis/packages/XENTIS/lib/libactivemq-cpp.so.19
-|[18] : decaf::util::concurrent::Lock::lock()+0x1f
-|[19] : 
decaf::util::concurrent::Lock::Lock(decaf::util::concurrent::Synchronizable*, 
bool)+0x55
-|[20] : 
activemq::core::ActiveMQConnection::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0x104
-|[21] : 
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
-|[22] : 
activemq::transport::correlator::ResponseCorrelator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xac
-|[23] : 
activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0x305
-|[24] : 
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
-|[25] : 
activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6
-|[26] : 
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
-|[27] : 
activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0x300
-|[28] : 
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
-|[29] : 
activemq::transport::IOTransport::fire(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9a
-|[30] : activemq::transport::IOTransport::run()+0xd0
-|[31] : +0xb7fc12
-|[32] : +0xb8042a

 

After comparing with the java equivalent of FailoverTransport, we noticed that 
there was a discrepancy in how the disposed listener works. It seems that the 
disposed listener is not initialized at all, which leads to the transport 
listener not putting this fallback in place, leaving dangling pointers to 
itself in other elements in the chain. We now ensure that there is always a 
valid disposed listener which will serve as the sink for any leftover calls. 
This new patch is attached.

 

26/11/2020 END OF UPDATE

___________________________________________

We're getting the following segmentation faults on a productive system with 160 
cores on a fairly regular basis:
{quote}-|/export/..../lib/libactivemq-cpp.so.19

{{-|[7] : 
activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0x175}}
 {{ -|[8] : 
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
 {{ -|[9] : 
activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6}}
 {{ -|[10] : 
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
 {{ -|[11] : 
activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0x292}}
 {{ -|[12] : 
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
 {{ -|[13] : 
activemq::transport::IOTransport::(fire(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9c}}
 {{ -|[14] : activemq::transport::IOTransport::run()+0xb1}}
{quote}
We've narrowed the problem down to the call to 
Transport::setTransportListener(NULL) in TransportFilter::close() - it seems to 
be the same race condition reported with:

[https://jira.apache.org/jira/browse/AMQCPP-534]

and

[https://jira.apache.org/jira/browse/AMQCPP-583]

 I'm guessing the fix will also be similar - a basic patch suggestion is 
attached. You may want to put the static DefaultTransportListener in a common 
place.

However, what is more worrying is that there appear to be several race 
conditions related to FailoverTransport::getTransportListener(). It provides 
mutex based access to the current listener, but all subsequent access to the 
raw pointer is unprotected. We had a concrete case in 
FailoverTransportListener::onCommand(). There are two calls to 
parent->getTransportListener() - one for the null ptr check, the second to 
invoke the command. In our seg fault, we observed that the first call returned 
a valid ptr, but the second call didn't (or returned null). Our explanation is 
that since getting and setting the TransportListener is protected by a mutex, 
another thread was able to get a hold of the mutex between the two calls and 
set it to FailoverTransport::setTransportListener(NULL).

Perhaps a much better way of handling this situation would be to use shared_ptr 
instead of a raw pointer for:

{{FailoverTransportImpl::TransportListener* transportListener;}}

{{std::shared_ptr<TransportListener> FailoverTransport::getTransportListener()}}

{{FailoverTransport::setTransportListener(std::shared_ptr<TransportListener> 
listener)}}

This way, a thread could set a new listener, and any threads still referencing 
the old listener would gracefully fail. We looked into providing a patch for 
that, but realized that would be a much bigger change involving 
ActiveMQConnection deriving from std::enable_shared_from_this, which would also 
mean replacing decaf::lang::Pointer by std::shared_ptr... Without knowing why 
decaf::lang::Pointer exists in the first place, it would be difficult to 
proceed without additional clarifications.

  was:
We're getting the following segmentation faults on a productive system with 160 
cores on a fairly regular basis:
{quote}-|/export/..../lib/libactivemq-cpp.so.19

{{-|[7] : 
activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0x175}}
 {{ -|[8] : 
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
 {{ -|[9] : 
activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6}}
 {{ -|[10] : 
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
 {{ -|[11] : 
activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0x292}}
 {{ -|[12] : 
activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
 {{ -|[13] : 
activemq::transport::IOTransport::(fire(decaf::lang::Pointer<activemq::commands::Command,
 decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9c}}
 {{ -|[14] : activemq::transport::IOTransport::run()+0xb1}}
{quote}
We've narrowed the problem down to the call to 
Transport::setTransportListener(NULL) in TransportFilter::close() - it seems to 
be the same race condition reported with:

https://jira.apache.org/jira/browse/AMQCPP-534

and

https://jira.apache.org/jira/browse/AMQCPP-583

 I'm guessing the fix will also be similar - a basic patch suggestion is 
attached. You may want to put the static DefaultTransportListener in a common 
place.

However, what is more worrying is that there appear to be several race 
conditions related to FailoverTransport::getTransportListener(). It provides 
mutex based access to the current listener, but all subsequent access to the 
raw pointer is unprotected. We had a concrete case in 
FailoverTransportListener::onCommand(). There are two calls to 
parent->getTransportListener() - one for the null ptr check, the second to 
invoke the command. In our seg fault, we observed that the first call returned 
a valid ptr, but the second call didn't (or returned null). Our explanation is 
that since getting and setting the TransportListener is protected by a mutex, 
another thread was able to get a hold of the mutex between the two calls and 
set it to FailoverTransport::setTransportListener(NULL).

Perhaps a much better way of handling this situation would be to use shared_ptr 
instead of a raw pointer for:

{{FailoverTransportImpl::TransportListener* transportListener;}}

{{std::shared_ptr<TransportListener> FailoverTransport::getTransportListener()}}

{{FailoverTransport::setTransportListener(std::shared_ptr<TransportListener> 
listener)}}

This way, a thread could set a new listener, and any threads still referencing 
the old listener would gracefully fail. We looked into providing a patch for 
that, but realized that would be a much bigger change involving 
ActiveMQConnection deriving from std::enable_shared_from_this, which would also 
mean replacing decaf::lang::Pointer by std::shared_ptr... Without knowing why 
decaf::lang::Pointer exists in the first place, it would be difficult to 
proceed without additional clarifications.


> Segmentation fault in TransportFilter & FailoverTransportListener
> -----------------------------------------------------------------
>
>                 Key: AMQCPP-646
>                 URL: https://issues.apache.org/jira/browse/AMQCPP-646
>             Project: ActiveMQ C++ Client
>          Issue Type: Bug
>          Components: Transports
>    Affects Versions: 3.9.3, 3.9.4
>         Environment: Red Hat Enterprise 7.6
>            Reporter: Hassan Raza
>            Assignee: Timothy A. Bish
>            Priority: Critical
>         Attachments: code.patch
>
>
> 26/11/2020 UPDATE:
> We ran into segmentation faults even with our patched library, which looked 
> like this:
> -|/apps/xentis/packages/XENTIS/lib/libactivemq-cpp.so.19
> -|[18] : decaf::util::concurrent::Lock::lock()+0x1f
> -|[19] : 
> decaf::util::concurrent::Lock::Lock(decaf::util::concurrent::Synchronizable*, 
> bool)+0x55
> -|[20] : 
> activemq::core::ActiveMQConnection::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x104
> -|[21] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
> -|[22] : 
> activemq::transport::correlator::ResponseCorrelator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xac
> -|[23] : 
> activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x305
> -|[24] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
> -|[25] : 
> activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6
> -|[26] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
> -|[27] : 
> activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x300
> -|[28] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xb5
> -|[29] : 
> activemq::transport::IOTransport::fire(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9a
> -|[30] : activemq::transport::IOTransport::run()+0xd0
> -|[31] : +0xb7fc12
> -|[32] : +0xb8042a
>  
> After comparing with the java equivalent of FailoverTransport, we noticed 
> that there was a discrepancy in how the disposed listener works. It seems 
> that the disposed listener is not initialized at all, which leads to the 
> transport listener not putting this fallback in place, leaving dangling 
> pointers to itself in other elements in the chain. We now ensure that there 
> is always a valid disposed listener which will serve as the sink for any 
> leftover calls. This new patch is attached.
>  
> 26/11/2020 END OF UPDATE
> ___________________________________________
> We're getting the following segmentation faults on a productive system with 
> 160 cores on a fairly regular basis:
> {quote}-|/export/..../lib/libactivemq-cpp.so.19
> {{-|[7] : 
> activemq::transport::failover::FailoverTransportListener::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x175}}
>  {{ -|[8] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
>  {{ -|[9] : 
> activemq::wireformat::openwire::OpenWireFormatNegotiator::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xe6}}
>  {{ -|[10] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
>  {{ -|[11] : 
> activemq::transport::inactivity::InactivityMonitor::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x292}}
>  {{ -|[12] : 
> activemq::transport::TransportFilter::onCommand(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0xc2}}
>  {{ -|[13] : 
> activemq::transport::IOTransport::(fire(decaf::lang::Pointer<activemq::commands::Command,
>  decaf::util::concurrent::atomic::AtomicRefCounter>)+0x9c}}
>  {{ -|[14] : activemq::transport::IOTransport::run()+0xb1}}
> {quote}
> We've narrowed the problem down to the call to 
> Transport::setTransportListener(NULL) in TransportFilter::close() - it seems 
> to be the same race condition reported with:
> [https://jira.apache.org/jira/browse/AMQCPP-534]
> and
> [https://jira.apache.org/jira/browse/AMQCPP-583]
>  I'm guessing the fix will also be similar - a basic patch suggestion is 
> attached. You may want to put the static DefaultTransportListener in a common 
> place.
> However, what is more worrying is that there appear to be several race 
> conditions related to FailoverTransport::getTransportListener(). It provides 
> mutex based access to the current listener, but all subsequent access to the 
> raw pointer is unprotected. We had a concrete case in 
> FailoverTransportListener::onCommand(). There are two calls to 
> parent->getTransportListener() - one for the null ptr check, the second to 
> invoke the command. In our seg fault, we observed that the first call 
> returned a valid ptr, but the second call didn't (or returned null). Our 
> explanation is that since getting and setting the TransportListener is 
> protected by a mutex, another thread was able to get a hold of the mutex 
> between the two calls and set it to 
> FailoverTransport::setTransportListener(NULL).
> Perhaps a much better way of handling this situation would be to use 
> shared_ptr instead of a raw pointer for:
> {{FailoverTransportImpl::TransportListener* transportListener;}}
> {{std::shared_ptr<TransportListener> 
> FailoverTransport::getTransportListener()}}
> {{FailoverTransport::setTransportListener(std::shared_ptr<TransportListener> 
> listener)}}
> This way, a thread could set a new listener, and any threads still 
> referencing the old listener would gracefully fail. We looked into providing 
> a patch for that, but realized that would be a much bigger change involving 
> ActiveMQConnection deriving from std::enable_shared_from_this, which would 
> also mean replacing decaf::lang::Pointer by std::shared_ptr... Without 
> knowing why decaf::lang::Pointer exists in the first place, it would be 
> difficult to proceed without additional clarifications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (AMQCPP-646) Segmentation fault in TransportFilter & FailoverTransportListener

Reply via email to