[ 
https://issues.apache.org/jira/browse/PROTON-1999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758319#comment-16758319
 ] 

Jeremy commented on PROTON-1999:
--------------------------------

Hello [~ODelbeke] and [~cliffjansen],

[~ODelbeke]: Before going into my analysis, can you please attach the gdb 
stacks for the other threads as well? Specifically, what is happening in the 
main thread.

In fact, we are facing the same randomness problem, even though we are using a 
pointer to a work queue. I've been debugging it for a couple of days now, and I 
suspect the problem comes from proton's memory management. When we don't have 
exceptions, everything runs smoothly. As soon as we start having exceptions we 
start having segfaults. On proton container errors, we stop the container and 
join the thread, and in the mean time, the main thread will propagate the 
proton error by throwing it as an exception (interrupting the normal flow 
rolling back). We took care of ensuring the following order of 
construction/destruction of proton objects through a RAII object we created:

Construction:
 * Create the handler
 * Create the container
 * Run the container in a new thread (we only call run in the new thread)
 * Use the handler, which can store proton objects (sender, receiver, trackers, 
and pointers to deliveries)

Destruction:
 * Release stored proton objects from the handler(sender.close(), 
receiver.close(), empty the queue of trackers and deliveries)
 * Join the thread, meaning, wait for the run method to exit
 * Destroy the container
 * Destroy the handler

Even then, the segfaults persisted.

Scenario:

We have 3 threads: the Main thread, the proton container thread for the sender, 
and a proton container thread for a broker.

In the proton handler, we have a send message method which looks just like the 
above examples, with the additional twist that our send message can throw an 
exception in the Main thread. We want to keep the tracker for further 
processing later. The code looks like this:
{code}
void SenderHandler::send(proton::message m)
{
...
   std::promise<proton::tracker> messageWillBeSent;
   m_senderWorkQueue->add([&]{
      messageWillBeSent.set_value(m_sender.send(m_messageToSend));
   });
   auto tracker = messageWillBeSent.get_future().get();

   waitForTrackerSettle(timeout); // checks for errors in proton, and throws an 
exception if an error did occur in proton
}
{code}
 In our case, we are simulating a problem with the broker. Therefore, the send 
will take an exception in the waitForTrackerSettle method.

The main thread will start to unroll, starting by the destruction of the 
tracker. The proton container thread, which took an error and propagated it to 
the main thread, in the mean time was finalizing the run method and exiting its 
thread. Both threads are manipulating reference counts of objects, and I 
suspect a race condition. Taking a look at the reference counting mechanism in 
proton 
([object.c|https://github.com/apache/qpid-proton/blob/0.26.0/c/src/core/object/object.c]),
 I see that the operations on reference counters are not atomic. In c++, 
shared_ptr reference counter operations are known to be 
atomic([shared_ptr_base.h|[https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/shared_ptr_base.h]).]
 I strongly suspect that this is not safe.

These cores, we get randomly, along with stacks that look exactly like the one 
you attached (with the main thread waiting on the thread.join()). Replying to  
[~ODelbeke]: "However, I still don't really understand why it solves the 
problem." We noticed that the smallest change in the code results in a 
different stack (sometimes destructor of connection, other times destructor of 
trackers, senders, ...). I'm not sure the result you're getting now is not 
random.

[~cliffjansen] I think you might better understand the inner workings of the 
memory management model of proton. Were race conditions on the reference 
counter factored in the design of the proton's memory management?

I will be testing a proton patch locally that substitutes the proton int 
reference counter, by std::atomic<int>.

> [c] Crash in pn_connection_finalize
> -----------------------------------
>
>                 Key: PROTON-1999
>                 URL: https://issues.apache.org/jira/browse/PROTON-1999
>             Project: Qpid Proton
>          Issue Type: Bug
>          Components: cpp-binding, proton-c
>    Affects Versions: proton-c-0.26.0
>         Environment: Linux 64-bits (Ubuntu 16.04 and Oracle Linux 7.4)
>            Reporter: Olivier Delbeke
>            Assignee: Cliff Jansen
>            Priority: Major
>         Attachments: call_stack.txt, example2.cpp, log.txt, main.cpp, 
> run_qpid-broker.sh
>
>
> Here is my situation : I have several proton::containers (~20). 
> Each one has its own proton::messaging_handler, and handles one 
> proton::connection to a local qpid-broker (everything runs on the same Linux 
> machine).
> 20 x ( one container with one handler with one connection with one link)
> Some containers/connections/handlers work in send mode ; they have one link 
> that is a proton::sender.
> Some containers/connections/handlers work in receive mode ; they have one 
> link that is a proton::receiver. Each time they receive an input message, 
> they do some processing on it, and finally add a "sender->send()" task to the 
> work queue of some sender handlers ( by calling work_queue()->add( [=] \{ 
> sender->send(msg); } as shown in the multi-threading examples).
> This works fine for some time (tens of thousands of messages, several minutes 
> or hours), but eventually crashes, either with a SEGFAULT (when the 
> qpid-proton lib is compiled in release mode) or with an assert (in debug 
> mode), in qpid-proton/c/src/core/engine.c line 483, 
> assert(!conn->transport->referenced) in function pn_connection_finalize().
> The proton logs (activated with export PN_TRACE_FRM=1) do not show anything 
> abnormal (no loss of connection, no rejection of messages, no timeouts, ...).
> As the connection is not closed, I wonder why pn_connection_finalize() would 
> be called in the first place.
> I joined the logs and the call trace.
> Happens on 0.26.0 but also reproduced with the latest master (Jan 28, 2019).
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org

Reply via email to