zbentley opened a new pull request, #16535: URL: https://github.com/apache/pulsar/pull/16535
Fixes https://github.com/apache/pulsar/issues/16527 ### Motivation Copied from issue discussion: The problem is the [GIL](https://wiki.python.org/moin/GlobalInterpreterLock). When Python object reference counts are manipulated from compiled code, those manipulations are not atomic or protected by the GIL in any way. Incrementing a refcount is often coincidentally safe to do without the GIL, since the data structures in the Python interpreter that are altered by a refcount-bump are few and not terribly shared. However, decrementing a refcount without the GIL is extremely dangerous; the act of decrementing a refcount can trigger object destruction, which can then trigger more object destruction, and so on: decrementing a refcount triggers an arbitrary number of user functions (destructors) to run in Python, and can trigger wide-ranging changes (including system calls, memory allocation/deallocation--basically anything) across the interpreter's internal state. Running such operations in true multi-threaded parallel in Python is basically guaranteed to break things. In most cases (I'm guessing here, as I don't know Boost/C++ well), I think the attempt to clean up the reference either blocks or fails in such a way that the C++ runtime won't properly clean up an object, preventing thread reaping from running internally. In some cases, the racy python GC operations overlap with shared interpreter data structures and cause segfaults. In rare cases, "impossible" (i.e. random objects changing type) errors are raised in Python itself, though you may have to run the above snippet for a long time to see one of those. Such GIL-unprotected refcount manipulation occurs in the Pulsar client [here](https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/src/config.cc#L106), [here](https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/src/config.cc#L114), [here](https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/src/config.cc#L172), and [here](https://github.com/apache/pulsar/blob/master/pulsar-client-cpp/python/src/config.cc#L176), though the first and third may be safe from this condition by dint of the fact that they're only invoked directly from calling Python code which already has the GIL. ### Modifications - **Done** protected logger handle refcount mutations with the GIL, resolving a segfault risk. - **Done** resolve thread leak. - **Done** add tests. - **Done** don't acquire GIL for logging messages that won't be emitted due to log level. ### Verifying this change This change added tests and can be verified via the Python client test suite: `python pulsar_test.py`. - [x] `doc-not-needed` Bugfix; external contracts do not change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
