[jira] [Comment Edited] (PROTON-2432) Proton crashes because of a concurrency failure in collector->pool

Robbie Gemmell (Jira) Tue, 21 Sep 2021 05:01:22 -0700


    [ 
https://issues.apache.org/jira/browse/PROTON-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418073#comment-17418073
 ]


Robbie Gemmell edited comment on PROTON-2432 at 9/21/21, 12:00 PM:
-------------------------------------------------------------------

In general killing the connection immediately at the server for a rejected 
message seems like the wrong approach, unless you absolutely know the client is 
about to do that (which it doesnt necessarily need to and often wouldnt), so I 
think your 'workaround' is actually just what it should always do. If you 
expect your client to close, then giving it a chance to first, and only then 
nuke it.

I dont think your proposed patch is the way to go. Again, most parts of proton 
as a whole arent thread safe, and so though it may stop that particular bit 
blowing up, there is still unexpected concurrency occurring and so something 
else might.

I'm guessing that container.schedule(IMMEDIATE,...); doesnt require use of a 
particular thread if your container has more than one as yours does. In the 
case of what you are doing, that means you need to ensure your conection isnt 
doing something else on another thread, as connections are to be used by one 
thread at a time.

I see that the connection exposes its own work_queue with has its own schedule 
method, for injecting work for that connection. I think perhaps thats something 
you should try. 
https://qpid.apache.org/releases/qpid-proton-0.35.0/proton/cpp/api/classproton_1_1connection.html#a7c755d6ac6385e007adb61966598ba63


was (Author: gemmellr):
In general killing the connection immediately at the server for a rejected 
message seems like the wrong approach, unless you absolutely know the client is 
about to do that (which it doesnt necessarily need to and often wouldnt), so I 
think your 'workaround' is actually just what it should always do. If you 
expect your client to close, then giving it a chance to first, and only then 
nuke it.

I dont think your proposed patch is the way to go. Again, most parts of proton 
as a while arent thread safe, and so though it may stop that particular bit 
blowing up, there is still unexpected concurrency occurring and so something 
else might.

I'm guessing that container.schedule(IMMEDIATE,...); doesnt require use of a 
particular thread if your container has more than one as yours does. In the 
case of what you are doing, that means you need to ensure your conection isnt 
doing something else on another thread, as connections are to be used by one 
thread at a time.

I see that the connection exposes its own work_queue with has its own schedule 
method, for injecting work for that connection. I think perhaps thats something 
you should try. 
https://qpid.apache.org/releases/qpid-proton-0.35.0/proton/cpp/api/classproton_1_1connection.html#a7c755d6ac6385e007adb61966598ba63

> Proton crashes because of a concurrency failure in collector->pool
> ------------------------------------------------------------------
>
>                 Key: PROTON-2432
>                 URL: https://issues.apache.org/jira/browse/PROTON-2432
>             Project: Qpid Proton
>          Issue Type: Bug
>          Components: proton-c
>    Affects Versions: proton-c-0.32.0
>         Environment: RHEL 7 
>            Reporter: Jesse Hulsizer
>            Priority: Major
>         Attachments: proton-2432.patch
>
>
> While running our application tests, our application crashes with many 
> different backtraces that look similar to this...
> {noformat}
> #0  0x0000000000000000 in ?? ()
> #1  0x00007fc777579198 in pn_class_incref () from 
> /usr/lib64/libqpid-proton.so.11
> #2  0x00007fc777587d8a in pn_collector_put () from 
> /usr/lib64/libqpid-proton.so.11
> #3  0x00007fc7775887ea in ?? () from /usr/lib64/libqpid-proton.so.11
> #4  0x00007fc777588c7b in pn_transport_pending () from 
> /usr/lib64/libqpid-proton.so.11
> #5  0x00007fc777588d9e in pn_transport_pop () from 
> /usr/lib64/libqpid-proton.so.11
> #6  0x00007fc777599298 in ?? () from /usr/lib64/libqpid-proton.so.11
> #7  0x00007fc77759a784 in ?? () from /usr/lib64/libqpid-proton.so.11
> #8  0x00007fc7773236f0 in proton::container::impl::thread() () from 
> /usr/lib64/libqpid-proton-cpp.so.12
> #9  0x00007fc7760b2470 in ?? () from /usr/lib64/libstdc++.so.6
> #10 0x00007fc776309aa1 in start_thread () from /lib64/libpthread.so.0
> #11 0x00007fc7758b6bdd in clone () from /lib64/libc.so.6{noformat}
> Using gdb to probe one of the backtraces show that the collector->pool size 
> is -1... (seen here as 18446744073709551615)
> {noformat}
> (gdb) p *collector $1 = \{pool = 0x7fa7182de180, head = 0x7fa7182de250, tail 
> = 0x7fa7182b8b90, prev = 0x7fa7182ea010, freed = false}
> (gdb) p collector->pool $2 = (pn_list_t *) 0x7fa7182de180 (gdb) p 
> *collector->pool $3 = \{clazz = 0x7fa74eb7c000, capacity = 16, size = 
> 18446744073709551615, elements = 0x7fa7182de1b0}{noformat}
> The proton code was marked up with print statements which show that two 
> threads were accessing the collector->pool data structure at the same time...
> {noformat} 
>  7b070700: pn_list_pop index 0 list->0x7fec401e0b70 value->0x7fec3c728a10
>  4ffff700:pn_list_add index 1 size 2list->0x7fec401e0b70 value->0x7fec402095b0
>  7b070700: pn_list_pop size 1 list->0x7fec401e0b70
>  4ffff700: pn_list_pop size 1 list->0x7fec401e0b70
>  7b070700: pn_list_pop index 0 list->0x7fec401e0b70 value->0x7fec3c728a10
>  4ffff700: pn_list_pop index 0 list->0x7fec401e0b70 
> value->0x7fec3c728a10{noformat}
> The hex number on the far left is the thread id. As can be seen in the last 
> two lines, two threads are popping from the collector->pool simultaneously. 
> This produces the -1 size as seen up above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PROTON-2432) Proton crashes because of a concurrency failure in collector->pool

Reply via email to