Pete Fawcett created PROTON-2903:
------------------------------------
Summary: Python Fatal Error in void2py ffi.from_handle
Key: PROTON-2903
URL: https://issues.apache.org/jira/browse/PROTON-2903
Project: Qpid Proton
Issue Type: Bug
Components: python-binding
Affects Versions: proton-c-0.39.0, proton-c-future
Environment: qpid-proton 0.41.0-dev (current main branch)
Rocky Linux 9.5
Python 3.12.5 (CPython)
cffi (pip package) 1.17.1
Reporter: Pete Fawcett
Experiencing sporadic, but not infrequent, "Python Fatal Error" in proton
client. IT seems to happen more often when the client is under more load,
I am investigating using the current main branch, but the relevant code seems
to be unchanged since v 0.39.0
The error is happening in the call to {{ffi.from_handle}} within
{{cproton.void2py}}
{{from_handle()}} is calling {{Py_FatalError()}} because the handle passed in,
whilst being a {{<cdata void *>}} object, is not pointing to an object of the
expected type ({{{}CDataOwningGC_Type{}}})
I made a custom version of {{cffi}} to include, in the error message, what type
was being pointed to. The results varied but they all seemed to be valid
types, i.e. not garbage.
This suggests to me that the original object has been freed and the memory used
by a new object.
I am only beginning to learn about the use of CFFI handles. but from what I
understand, it is the responsibility of the Python code, when creating a handle
to a Python object using {{{}ffi.new_handle(){}}}, to keep the handle and the
original 'alive' by maintaining a reference to the handle.
This appears to be the intended function of {{cproton.retained_objects}} but I
think the current implementation is too simple.
Some things that I think are contributing to the problem are:
* {{retained_objects}} is a set so effectively only provides a reference count
of 1
* {{retained_objects}} is added to on the 'Python side' following calls to
{{new_handle}} but also, indirectly, from the 'C side' via {{pn_pyref_incref}}
* Different {{<cdata>}} instances can hash to the same value and so are
treated as being the same within {{{}retained_objects{}}}. (I think the hash
is based on the content, so where a {{<cdata void *>}} is pointing)
One potentially problematic call sequence I have seen is in
{{cproton.pn_collector_put_pyref}}
# A new handle ({{{}d{}}}) is created an added to {{retained_objects}}
# {{d}} is passed to {{lib.pn_collector_put_py}}
# within that call a callback is made to {{pn_pyref_incref}} with parameter
{{obj}}
# {{obj}} is a different instance to {{d}} (the {{id()}} differs) but is has
the same hash and so the call to {{retained_objects.add(obj)}} doesn't add a
new item.
So, if at some point {{pn_pyref_decref(obj)}} is called, it will release the
original {{d}} object.
Whilst I think I can see why the problem might be happening I am not sure how
to fix it. In particular, I haven't yet worked out what would be the correct
point to release the reference in {{{}retained_objects{}}}. Not releasing them
at all would stop the fatal error but would effectively be a "memory leak" with
resultant problems.
It might be that better reference counting is required, rather than just the
single level afforded by {{retained_objects}} being a {{set}} - but then there
might not be "enough" {{decref}} calls to free up the object at all.
Perhaps releasing the handle within {{void2py}} would work, but that feels too
risky without better understanding of the intended lifetime.
I would appreciate some guidance on the original design thoughts behind the
current implementation, particularly around knowing when the handle is no
longer required.
My current client program is non-trivial. It both receives and sends messages
and also makes use of {{{}ApplicationEvent{}}}. I am looking at how to make a
simpler client that can still produce the error.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]