justintime4tea opened a new issue, #519:
URL: https://github.com/apache/pulsar-client-cpp/issues/519

   ### Search before reporting
   
   - [x] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### What issue do you find in Pulsar docs?
   
   Ran into a rather nasty to debug bug when writing integration tests for a 
node-js application that uses "pulsar-client" NPM package. Essentially, the 
issue is though it seems like the "pulsar-client" supports multiple instances 
of a Pulsar client, the C++ side of things doesn't fully isolate these 
instances and a single `static LogCallback *logCallback` is created for ALL 
clients with each client overwriting this static callback with whatever handler 
the last executed `Pulsar.Client.setLogHandler` or `Pulsar.Client` constructor 
is given. The fact that you can only have a single log handler isn't documented 
other than some comments I see in the C++ code nor is the potential issue when 
setting multiple with multiple log handlers clients.
   
   ## Why this is an issue
   This `static LogCallback *logCallback` being shared results in a potential 
race-condition whereby some final message from one of the multiple C++ pulsar 
client is attempted to be logged and a crash occurs. Since neither the `Close` 
or `~Client()` destructor clear/set `static Client::logCallback` to null on the 
C++ side, it's possible that `static Client::logCallback` becomes a dangling 
pointer pointing to a `ThreadSafeFunction` that was part of a previous client, 
which has since been destroyed.
   
   ## Repro
   **Looking at the Client::Close method and the Client::~Client destructor:**
   - `Client::~Client() {}` (The destructor is empty)
   - `Napi::Value Client::Close(const Napi::CallbackInfo &info)` (This function 
only calls pulsar_client_close_async. It does not touch Client::logCallback.)
   
   **Conceptual repro steps:**
   
   1. You create Client A, which sets the `static Client::logCallback` to point 
to its logger.
   2. You create Client B, which replaces the `static Client::logCallback` to 
point to its logger. (Client A's original logger function is now abandoned).
   3. You call await `ClientB.close()`. The Close method is called, and 
eventually `~Client()` is called.
   4. Neither Close nor the destructor clears the logCallback.
   5. logCallback is now a dangling pointer, pointing to a `ThreadSafeFunction` 
that was part of Client B, which is now destroyed.
   6. A background thread from Client A tries to log a message. It uses the 
`static logCallback`, accesses the dangling pointer, and the process crashes.
   
   ### What is your suggestion?
   
   I suggest that, at bare minimum and baring the C++ code being "corrected", 
documentation is updated so that consumers of pulsar client libraries are made 
aware of this. I'm not sure exactly how, because it doesn't seem like you'd be 
safe even just setting the handler a single time due to the nature of the 
"close" and "destructor" code in C++.
   
   ### Any reference?
   
   **Diagnostics**
   Though I don't have a code repro I do have a breakdown of how I stumbled 
upon this - In integration tests where there are many pulsar clients being 
created and closed, as well as producers/consumers created/closed, along with 
the underlying Node JS app being tested creating/closing it's own pulsar client 
and consumers/producers, the race condition is triggered and a SIGABRT is 
displayed near the end of the test output with the following message:
   
   ```
   libc++abi: terminating due to uncaught exception of type Napi::Error:
    ✖  SIGABRT
   ```
   
   Running the tests using lldb to capture the coredump results in the 
following which lead me to the issue:
   
   ```
       frame apache/pulsar#8: 0x00000001922fbbd8 libc++abi.dylib`__cxa_throw + 
92
       frame apache/pulsar#9: 0x0000000151a485d4 
pulsar.node`Napi::Function::Call(std::initializer_list<napi_value__*> const&) 
const + 196
       frame apache/pulsar#10: 0x0000000151a4fc10 
pulsar.node`LogMessageProxy(Napi::Env, Napi::Function, LogMessage*) + 272
       frame apache/pulsar#11: 0x0000000151a4c1ec 
pulsar.node`Napi::ThreadSafeFunction::CallJS(napi_env__*, napi_value__*, void*, 
void*) + 68
       frame apache/pulsar#12: 0x00000001000c6b94 node`v8impl::(anonymous 
namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) + 472
   ```
   
   ## Workarounds
   - Explicitly calling `Pulsar.Client.setLogHandler(null)` in "cleanup" stage 
of tests and "graceful shutdown" stage of the NodeJS application bypasses the 
issue
   - For whatever reason, this only occurs on node versions below v24. 
     - I tested and repro'ed in many v22.x versions and v23.x versions
     - I can't seem to reproduce this issue with node v24.
   
   ## Full Coredump:
   
   ```
   * thread apache/pulsar#1
     * frame #0: 0x000000019230a388 libsystem_kernel.dylib`__pthread_kill + 8
       frame apache/pulsar#1: 0x000000019234388c 
libsystem_pthread.dylib`pthread_kill + 296
       frame apache/pulsar#2: 0x000000019224ca3c libsystem_c.dylib`abort + 124
       frame apache/pulsar#3: 0x00000001922f9384 libc++abi.dylib`abort_message 
+ 132
       frame apache/pulsar#4: 0x00000001922e7cd8 
libc++abi.dylib`demangling_terminate_handler() + 316
       frame apache/pulsar#5: 0x0000000191f6cde4 
libobjc.A.dylib`_objc_terminate() + 172
       frame apache/pulsar#6: 0x00000001922f8698 
libc++abi.dylib`std::__terminate(void (*)()) + 16
       frame apache/pulsar#7: 0x00000001922fbc30 
libc++abi.dylib`__cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88
       frame apache/pulsar#8: 0x00000001922fbbd8 libc++abi.dylib`__cxa_throw + 
92
       frame apache/pulsar#9: 0x0000000151a485d4 
pulsar.node`Napi::Function::Call(std::initializer_list<napi_value__*> const&) 
const + 196
       frame apache/pulsar#10: 0x0000000151a4fc10 
pulsar.node`LogMessageProxy(Napi::Env, Napi::Function, LogMessage*) + 272
       frame apache/pulsar#11: 0x0000000151a4c1ec 
pulsar.node`Napi::ThreadSafeFunction::CallJS(napi_env__*, napi_value__*, void*, 
void*) + 68
       frame apache/pulsar#12: 0x00000001000c6b94 node`v8impl::(anonymous 
namespace)::ThreadSafeFunction::AsyncCb(uv_async_s*) + 472
       frame apache/pulsar#13: 0x0000000100f01478 node`uv__async_io + 304
       frame apache/pulsar#14: 0x0000000100f1524c node`uv__io_poll + 1512
       frame apache/pulsar#15: 0x0000000100f019e0 node`uv_run + 408
       frame apache/pulsar#16: 0x0000000100093004 
node`node::Environment::CleanupHandles() + 188
       frame apache/pulsar#17: 0x000000010009377c 
node`node::Environment::RunCleanup() + 292
       frame apache/pulsar#18: 0x0000000100009ff0 
node`node::FreeEnvironment(node::Environment*) + 128
       frame apache/pulsar#19: 0x000000010014d26c 
node`node::NodeMainInstance::Run() + 304
       frame apache/pulsar#20: 0x00000001000bf5f4 node`node::Start(int, char**) 
+ 600
       frame apache/pulsar#21: 0x0000000191fa2b98 dyld`start + 6076
     thread apache/pulsar#2
       frame #0: 0x0000000192307d04 libsystem_kernel.dylib`kevent + 8
       frame apache/pulsar#1: 0x0000000100f14f74 node`uv__io_poll + 784
       frame apache/pulsar#2: 0x0000000100f019e0 node`uv_run + 408
       frame apache/pulsar#3: 0x00000001001825f0 
node`node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Run() + 352
       frame apache/pulsar#4: 0x0000000192343c0c 
libsystem_pthread.dylib`_pthread_start + 136
     thread apache/pulsar#3
       frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
       frame apache/pulsar#1: 0x00000001923440e0 
libsystem_pthread.dylib`_pthread_cond_wait + 984
       frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
       frame apache/pulsar#3: 0x00000001001827d8 
node`node::TaskQueue<v8::Task>::BlockingPop() + 60
       frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous 
namespace)::PlatformWorkerThread(void*) + 384
       frame apache/pulsar#5: 0x0000000192343c0c 
libsystem_pthread.dylib`_pthread_start + 136
     thread apache/pulsar#4
       frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
       frame apache/pulsar#1: 0x00000001923440e0 
libsystem_pthread.dylib`_pthread_cond_wait + 984
       frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
       frame apache/pulsar#3: 0x00000001001827d8 
node`node::TaskQueue<v8::Task>::BlockingPop() + 60
       frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous 
namespace)::PlatformWorkerThread(void*) + 384
       frame apache/pulsar#5: 0x0000000192343c0c 
libsystem_pthread.dylib`_pthread_start + 136
     thread apache/pulsar#5
       frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
       frame apache/pulsar#1: 0x00000001923440e0 
libsystem_pthread.dylib`_pthread_cond_wait + 984
       frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
       frame apache/pulsar#3: 0x00000001001827d8 
node`node::TaskQueue<v8::Task>::BlockingPop() + 60
       frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous 
namespace)::PlatformWorkerThread(void*) + 384
       frame apache/pulsar#5: 0x0000000192343c0c 
libsystem_pthread.dylib`_pthread_start + 136
     thread apache/pulsar#6
       frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
       frame apache/pulsar#1: 0x00000001923440e0 
libsystem_pthread.dylib`_pthread_cond_wait + 984
       frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
       frame apache/pulsar#3: 0x00000001001827d8 
node`node::TaskQueue<v8::Task>::BlockingPop() + 60
       frame apache/pulsar#4: 0x000000010018028c node`node::(anonymous 
namespace)::PlatformWorkerThread(void*) + 384
       frame apache/pulsar#5: 0x0000000192343c0c 
libsystem_pthread.dylib`_pthread_start + 136
     thread apache/pulsar#7
       frame #0: 0x0000000192301bb0 libsystem_kernel.dylib`semaphore_wait_trap 
+ 8
       frame apache/pulsar#1: 0x0000000100f0fed0 node`uv_sem_wait + 24
       frame apache/pulsar#2: 0x00000001002508bc 
node`node::inspector::(anonymous namespace)::StartIoThreadMain(void*) + 56
       frame apache/pulsar#3: 0x0000000192343c0c 
libsystem_pthread.dylib`_pthread_start + 136
     thread apache/pulsar#8
       frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
       frame apache/pulsar#1: 0x00000001923440e0 
libsystem_pthread.dylib`_pthread_cond_wait + 984
       frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
       frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392
       frame apache/pulsar#4: 0x0000000192343c0c 
libsystem_pthread.dylib`_pthread_start + 136
     thread apache/pulsar#9
       frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
       frame apache/pulsar#1: 0x00000001923440e0 
libsystem_pthread.dylib`_pthread_cond_wait + 984
       frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
       frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392
       frame apache/pulsar#4: 0x0000000192343c0c 
libsystem_pthread.dylib`_pthread_start + 136
     thread apache/pulsar#10
       frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
       frame apache/pulsar#1: 0x00000001923440e0 
libsystem_pthread.dylib`_pthread_cond_wait + 984
       frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
       frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392
       frame apache/pulsar#4: 0x0000000192343c0c 
libsystem_pthread.dylib`_pthread_start + 136
     thread apache/pulsar#11
       frame #0: 0x00000001923053cc libsystem_kernel.dylib`__psynch_cvwait + 8
       frame apache/pulsar#1: 0x00000001923440e0 
libsystem_pthread.dylib`_pthread_cond_wait + 984
       frame apache/pulsar#2: 0x0000000100f10054 node`uv_cond_wait + 40
       frame apache/pulsar#3: 0x0000000100efdbe4 node`worker + 392
       frame apache/pulsar#4: 0x0000000192343c0c 
libsystem_pthread.dylib`_pthread_start + 136
     thread apache/pulsar#12
       frame #0: 0x00000001923038b0 libsystem_kernel.dylib`__workq_kernreturn + 
8
   ```
   
   
   ### Are you willing to submit a PR?
   
   - [ ] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to