justintime4tea opened a new issue, #519: URL: https://github.com/apache/pulsar-client-cpp/issues/519
### Search before reporting - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) and found nothing similar. ### What issue do you find in Pulsar docs? Ran into a rather nasty to debug bug when writing integration tests for a node-js application that uses "pulsar-client" NPM package. Essentially, the issue is though it seems like the "pulsar-client" supports multiple instances of a Pulsar client, the C++ side of things doesn't fully isolate these instances and a single `static LogCallback *logCallback` is created for ALL clients with each client overwriting this static callback with whatever handler the last executed `Pulsar.Client.setLogHandler` or `Pulsar.Client` constructor is given. The fact that you can only have a single log handler isn't documented other than some comments I see in the C++ code nor is the potential issue when setting multiple with multiple log handlers clients. ## Why this is an issue This `static LogCallback *logCallback` being shared results in a potential race-condition whereby some final message from one of the multiple C++ pulsar client is attempted to be logged and a crash occurs. Since neither the `Close` or `~Client()` destructor clear/set `static Client::logCallback` to null on the C++ side, it's possible that `static Client::logCallback` becomes a dangling pointer pointing to a `ThreadSafeFunction` that was part of a previous client, which has since been destroyed. ## Repro **Looking at the Client::Close method and the Client::~Client destructor:** - `Client::~Client() {}` (The destructor is empty) - `Napi::Value Client::Close(const Napi::CallbackInfo &info)` (This function only calls pulsar_client_close_async. It does not touch Client::logCallback.) **Conceptual repro steps:** 1. You create Client A, which sets the `static Client::logCallback` to point to its logger. 2. You create Client B, which replaces the `static Client::logCallback` to point to its logger. (Client A's original logger function is now abandoned). 3. You call await `ClientB.close()`. The Close method is called, and eventually `~Client()` is called. 4. Neither Close nor the destructor clears the logCallback. 5. logCallback is now a dangling pointer, pointing to a `ThreadSafeFunction` that was part of Client B, which is now destroyed. 6. A background thread from Client A tries to log a message. It uses the `static logCallback`, accesses the dangling pointer, and the process crashes. ### What is your suggestion? I suggest that, at bare minimum and baring the C++ code being "corrected", documentation is updated so that consumers of pulsar client libraries are made aware of this. I'm not sure exactly how, because it doesn't seem like you'd be safe even just setting the handler a single time due to the nature of the "close" and "destructor" code in C++. ### Any reference? **Diagnostics** Though I don't have a code repro I do have a breakdown of how I stumbled upon this - In integration tests where there are many pulsar clients being created and closed, as well as producers/consumers created/closed, along with the underlying Node JS app being tested creating/closing it's own pulsar client and consumers/producers, the race condition is triggered and a SIGABRT is displayed near the end of the test output with the following message: ``` libc++abi: terminating due to uncaught exception of type Napi::Error: ✖ SIGABRT ``` Running the tests using lldb to capture the coredump results in the following which lead me to the issue: ``` frame apache/pulsar#8: 0x00000001922fbbd8 libc++abi.dylib`__cxa_throw + 92 frame apache/pulsar#9: 0x0000000151a485d4 pulsar.node`Napi::Function::Call(std::initializer_list<napi_value__*> const&) const + 196 frame apache/pulsar#10: 0x0000000151a4fc10 pulsar.node`LogMessageProxy(Napi::Env, Napi::Function, LogMessage*) + 272 frame apache/pulsar#11: 0x0000000151a4c1ec pulsar.node`Napi::ThreadSafeFunction::CallJS(napi_env__*, napi_value__*, void*, void*) + 68 frame apache/pulsar#12: 0x00000001000c6b94 node`v8impl::(anonymous namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) + 472 ``` ## Workarounds - Explicitly calling `Pulsar.Client.setLogHandler(null)` in "cleanup" stage of tests and "graceful shutdown" stage of the NodeJS application bypasses the issue - For whatever reason, this only occurs on node versions below v24. - I tested and repro'ed in many v22.x versions and v23.x versions - I can't seem to reproduce this issue with node v24. ## Full Coredump: ``` * thread apache/pulsar#1 * frame #0: 0x000000019230a388 libsystem_kernel.dylib`__pthread_kill + 8 frame apache/pulsar#1: 0x000000019234388c libsystem_pthread.dylib`pthread_kill + 296 frame apache/pulsar#2: 0x000000019224ca3c libsystem_c.dylib`abort + 124 frame apache/pulsar#3: 0x00000001922f9384 libc++abi.dylib`abort_message + 132 frame apache/pulsar#4: 0x00000001922e7cd8 libc++abi.dylib`demangling_terminate_handler() + 316 frame apache/pulsar#5: 0x0000000191f6cde4 libobjc.A.dylib`_objc_terminate() + 172 frame apache/pulsar#6: 0x00000001922f8698 libc++abi.dylib`std::__terminate(void (*)()) + 16 frame apache/pulsar#7: 0x00000001922fbc30 libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88 frame apache/pulsar#8: 0x00000001922fbbd8 libc++abi.dylib`__cxa_throw + 92 frame apache/pulsar#9: 0x0000000151a485d4 pulsar.node`Napi::Function::Call(std::initializer_list<napi_value__*> const&) const + 196 frame apache/pulsar#10: 0x0000000151a4fc10 pulsar.node`LogMessageProxy(Napi::Env, Napi::Function, LogMessage*) + 272 frame apache/pulsar#11: 0x0000000151a4c1ec pulsar.node`Napi::ThreadSafeFunction::CallJS(napi_env__*, napi_value__*, void*, void*) + 68 frame apache/pulsar#12: 0x00000001000c6b94 node`v8impl::(anonymous namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) + 472 frame apache/pulsar#13: 0x0000000100f01478 node`uv__async_io + 304 frame apache/pulsar#14: 0x0000000100f1524c node`uv__io_poll + 1512 frame apache/pulsar#15: 0x0000000100f019e0 node`uv_run + 408 frame apache/pulsar#16: 0x0000000100093004 node`node::Environment::CleanupHandles() + 188 frame apache/pulsar#17: 0x000000010009377c node`node::Environment::RunCleanup() + 292 frame apache/pulsar#18: 0x0000000100009ff0 node`node::FreeEnvironment(node::Environment*) + 128 frame apache/pulsar#19: 0x000000010014d26c node`node::NodeMainInstance::Run() + 304 frame apache/pulsar#20: 0x00000001000bf5f4 node`node::Start(int, char**) + 600 frame apache/pulsar#21: 0x0000000191fa2b98 dyld`start + 6076 thread apache/pulsar#2 frame #0: 0x0000000192307d04 libsystem_kernel.dylib`kevent + 8 frame apache/pulsar#1: 0x0000000100f14f74 node`uv__io_poll + 784 frame apache/pulsar#2: 0x0000000100f019e0 node`uv_run + 408 frame apache/pulsar#3: 0x00000001001825f0 node`node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Run() + 352 frame apache/pulsar#4: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136 thread apache/pulsar#3 frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8 frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984 frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40 frame apache/pulsar#3: 0x00000001001827d8 node`node::TaskQueue<v8::Task>::BlockingPop() + 60 frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous namespace)::PlatformWorkerThread(void*) + 384 frame apache/pulsar#5: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136 thread apache/pulsar#4 frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8 frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984 frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40 frame apache/pulsar#3: 0x00000001001827d8 node`node::TaskQueue<v8::Task>::BlockingPop() + 60 frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous namespace)::PlatformWorkerThread(void*) + 384 frame apache/pulsar#5: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136 thread apache/pulsar#5 frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8 frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984 frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40 frame apache/pulsar#3: 0x00000001001827d8 node`node::TaskQueue<v8::Task>::BlockingPop() + 60 frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous namespace)::PlatformWorkerThread(void*) + 384 frame apache/pulsar#5: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136 thread apache/pulsar#6 frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8 frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984 frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40 frame apache/pulsar#3: 0x00000001001827d8 node`node::TaskQueue<v8::Task>::BlockingPop() + 60 frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous namespace)::PlatformWorkerThread(void*) + 384 frame apache/pulsar#5: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136 thread apache/pulsar#7 frame #0: 0x0000000192301bb0 libsystem_kernel.dylib`semaphore_wait_trap + 8 frame apache/pulsar#1: 0x0000000100f0fed0 node`uv_sem_wait + 24 frame apache/pulsar#2: 0x00000001002508bc node`node::inspector::(anonymous namespace)::StartIoThreadMain(void*) + 56 frame apache/pulsar#3: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136 thread apache/pulsar#8 frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8 frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984 frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40 frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392 frame apache/pulsar#4: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136 thread apache/pulsar#9 frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8 frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984 frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40 frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392 frame apache/pulsar#4: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136 thread apache/pulsar#10 frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8 frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984 frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40 frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392 frame apache/pulsar#4: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136 thread apache/pulsar#11 frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8 frame apache/pulsar#1: 0x00000001923440e0 libsystem_pthread.dylib`_pthread_cond_wait + 984 frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40 frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392 frame apache/pulsar#4: 0x0000000192343c0c libsystem_pthread.dylib`_pthread_start + 136 thread apache/pulsar#12 frame #0: 0x00000001923038b0 libsystem_kernel.dylib`__workq_kernreturn + 8 ``` ### Are you willing to submit a PR? - [ ] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
