[ 
https://issues.apache.org/jira/browse/PROTON-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14338377#comment-14338377
 ] 

Rafael H. Schloming commented on PROTON-829:
--------------------------------------------

I added that code to get the python tests to pass without valgrind errors, so 
I'm surprised that removing it doesn't cause issues.

I'll try to explain a little bit about what's going on. (I've commented this 
elsewhere in the code, but I missed this spot. I think andrew pointed out that 
this might be clearer with a helper function instead of/in addition to a 
comment.)

There are definitely non-debug cases where it is useful to examine the 
refcount. In particular, a finalizer gets fired when the refcount hits zero, 
but the finalizer can sometimes run code that creates a new reference to the 
object that is being finalized. There are a few places where this pattern is 
very useful:

  - The finalizer can run some logic that decides whether or not the object 
should be pooled and if so sticking it in the pool adds a reference.
  - The finalizer might run some code that posts an event, thereby creating a 
reference from the collector to the object.
  - The connection/session/link/delivery/transport objects all have pointers to 
each other, and to just naively refcount all those pointers would result in 
circular references and memory leaks. In order to avoid circular references, we 
don't refcount internal pointers, e.g. the parent->child pointers, instead the 
finalizers of all the children have logic that runs and figures out if it is 
safe for the child to go away (basically does a tiny amount of incremental 
garbage collection). If the finalizer's "garbage collection" finds an internal 
reference, the parent of that object takes over the last external reference. In 
the latter case an external pointer to a child may be created again, and to 
account for this, the incref of all the child objects is overriden to remove 
the reference that the parent took over.

In all these cases, the logic that may/may not create a new reference is run at 
the beginning of the finalizer. The finalizer then needs to check if a new 
reference has actually been created in order to figure out whether or not to 
actually free the resources used by the object or to return. This pattern could 
perhaps be formalized by distinguishing between finalizers (the part that runs 
application logic which can create new references) and destructors (the part 
that actually frees anything malloced in the initializer) in the object code.

This should also explain why incref/decref is not necessarily a noop, it is 
basically retriggering the "gc" portion of the code in the finalizer that 
decides whether or not an object should really go away or not, and in the rare 
case that you toggle some state that might influence that gc logic, you need to 
get it to run again. Clearing the tp work queue is one such case. The reason it 
is guarded by a check on the refcount is that we also call that function from 
inside the delivery finalizer, and in that case if the refcount is zero we've 
already decided that the object should go away.

Again, this gc part of the pattern could probably be formalized a bit, e.g. 
possibly having some sort of mark/sweep style thing that is in fact triggered 
by external references reaching zero, but probably not a good idea to attempt 
anytime soon since what complicates this picture is not so much the fact that 
it isn't formalized, but rather that the engine uses ad-hoc data structures 
that predate the development of the collections that are part of the object 
library and therefore has to manually account and search for references in a 
type specific way rather than using the automatic ref counting and/or search 
routines available in the collections. This manual/type specific accounting is 
what makes the "gc" logic a bit brittle, and could possibly be the source of 
the bug you are seeing. At some point relatively soon (probably after 0.9 but 
before the next release) I expect to have to update a significant chunk of 
those data structures to properly use the collections, but until then if you 
manage to pare this down to a minimal reproducer in terms of proton API calls, 
that would definitely suggest the bug is in proton and I can probably help 
debug more from there. As it is, it looks like it could possibly just be a 
plain old double free style bug in messaging, and removing the above code just 
masks it by introducing a memory leak.

> Possible reference counting bug in pn_clear_tpwork
> --------------------------------------------------
>
>                 Key: PROTON-829
>                 URL: https://issues.apache.org/jira/browse/PROTON-829
>             Project: Qpid Proton
>          Issue Type: Bug
>          Components: proton-c
>    Affects Versions: 0.8
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.9
>
>
> See QPID-6415 which describes a core dump in the qpid tests that appears when 
> using the current 0.9 proton master. The qpid tests pass OK with proton 0.8.
> The valgrind output in QPID-6415 shows that a connection is deleted while it 
> is being finalized by  a call from pn_connection_unbound to pn_clear_tpwork.
> I do not yet understand the details, but removing the following strange code 
> fixes the problem and passes the proton test suite without valgrind errors:
> {noformat}
> --- a/proton-c/src/engine/engine.c
> +++ b/proton-c/src/engine/engine.c
> @@ -690,10 +690,10 @@ void pn_clear_tpwork(pn_delivery_t *delivery)
>    {
>      LL_REMOVE(connection, tpwork, delivery);
>      delivery->tpwork = false;
> -    if (pn_refcount(delivery) > 0) {
> -      pn_incref(delivery);
> -      pn_decref(delivery);
> -    }
>    }
>  }
> {noformat}
> The code is strange because
> a) you should never examine a refcount except for debugging purposes
> b) under normal refcounting semantics incref+decref is a no-op.
> Is removing this code OK?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to