[ 
https://issues.apache.org/jira/browse/PROTON-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416970#comment-17416970
 ] 

Jesse Hulsizer commented on PROTON-2432:
----------------------------------------

Some background on our application - it acts as a server for a number of remote 
clients using the AMQP 1.0 protocol, using the qpid-proton library. These 
clients send and receive messages via AMQP 1.0, and these messages are 
consolidated and sent to and from a message sink/source (either a AMQP qpid-cpp 
broker or an AMQ interconnect router). For the test, the number of remote 
clients is around 300.

Our application uses a single proton listener, spawning a new 
proton::messaging_handler for each connection. These all run inside a single 
proton container, spawned with (typically) 8 threads (I.E., container.run(8); ).

The issue occurs when the message sink is unavailable. When this occurs, and a 
client remote sends a message to the application, the application does 2 things:
 1. The incoming message is rejected via the proton settling mechanism. This 
causes the client remote to issue a connection close on the remote end. 
 2. The server application also issues a connection close call on the local 
end, via a `container.schedule(IMMEDIATE,...);` call.

What we find is on rare occasions we hit the timing just so such that the 
remote connection close call and local connection close call occur 
simultaneously on separate threads inside the proton container, triggering the 
above synchronous pop on the collector->pool list, with the resulting incorrect 
size="-1" and subsequent segfault.

We estimate that this occurs somewhere in the order of once every 5000 to 10000 
client disconnect actions.

In order to work around this issue on the application side, we've added a delay 
in the the local close operation, allowing the client remote time to issue the 
connection close, and only closing on the local side if the client has not 
closed the connection within a second.

We are still working on a simple reproducer, but I've attached a patch file for 
a (hacky?) fix in proton that gives the pn_list_t a lock attribute, allowing 
for (and implementing) protected access in the collector->pool case. Hopefully 
this helps.

> Proton crashes because of a concurrency failure in collector->pool
> ------------------------------------------------------------------
>
>                 Key: PROTON-2432
>                 URL: https://issues.apache.org/jira/browse/PROTON-2432
>             Project: Qpid Proton
>          Issue Type: Bug
>          Components: proton-c
>    Affects Versions: proton-c-0.32.0
>         Environment: RHEL 7 
>            Reporter: Jesse Hulsizer
>            Priority: Major
>
> While running our application tests, our application crashes with many 
> different backtraces that look similar to this...
> {noformat}
> #0  0x0000000000000000 in ?? ()
> #1  0x00007fc777579198 in pn_class_incref () from 
> /usr/lib64/libqpid-proton.so.11
> #2  0x00007fc777587d8a in pn_collector_put () from 
> /usr/lib64/libqpid-proton.so.11
> #3  0x00007fc7775887ea in ?? () from /usr/lib64/libqpid-proton.so.11
> #4  0x00007fc777588c7b in pn_transport_pending () from 
> /usr/lib64/libqpid-proton.so.11
> #5  0x00007fc777588d9e in pn_transport_pop () from 
> /usr/lib64/libqpid-proton.so.11
> #6  0x00007fc777599298 in ?? () from /usr/lib64/libqpid-proton.so.11
> #7  0x00007fc77759a784 in ?? () from /usr/lib64/libqpid-proton.so.11
> #8  0x00007fc7773236f0 in proton::container::impl::thread() () from 
> /usr/lib64/libqpid-proton-cpp.so.12
> #9  0x00007fc7760b2470 in ?? () from /usr/lib64/libstdc++.so.6
> #10 0x00007fc776309aa1 in start_thread () from /lib64/libpthread.so.0
> #11 0x00007fc7758b6bdd in clone () from /lib64/libc.so.6{noformat}
> Using gdb to probe one of the backtraces show that the collector->pool size 
> is -1... (seen here as 18446744073709551615)
> {noformat}
> (gdb) p *collector $1 = \{pool = 0x7fa7182de180, head = 0x7fa7182de250, tail 
> = 0x7fa7182b8b90, prev = 0x7fa7182ea010, freed = false}
> (gdb) p collector->pool $2 = (pn_list_t *) 0x7fa7182de180 (gdb) p 
> *collector->pool $3 = \{clazz = 0x7fa74eb7c000, capacity = 16, size = 
> 18446744073709551615, elements = 0x7fa7182de1b0}{noformat}
> The proton code was marked up with print statements which show that two 
> threads were accessing the collector->pool data structure at the same time...
> {noformat} 
>  7b070700: pn_list_pop index 0 list->0x7fec401e0b70 value->0x7fec3c728a10
>  4ffff700:pn_list_add index 1 size 2list->0x7fec401e0b70 value->0x7fec402095b0
>  7b070700: pn_list_pop size 1 list->0x7fec401e0b70
>  4ffff700: pn_list_pop size 1 list->0x7fec401e0b70
>  7b070700: pn_list_pop index 0 list->0x7fec401e0b70 value->0x7fec3c728a10
>  4ffff700: pn_list_pop index 0 list->0x7fec401e0b70 
> value->0x7fec3c728a10{noformat}
> The hex number on the far left is the thread id. As can be seen in the last 
> two lines, two threads are popping from the collector->pool simultaneously. 
> This produces the -1 size as seen up above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to