RaphDal opened a new issue, #6869: URL: https://github.com/apache/opendal/issues/6869
### Describe the bug Tokio worker threads used by OpenDAL rely on thread-local storage (TLS) in `jni-rs` to detach themselves from the JVM on shutdown. Detaching from the JVM requires calling back into the JVM, which may block if a GC is in progress. When this happens from a TLS destructor, the code runs while the Windows [loader lock](https://learn.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-best-practices) is held. At the same time, when the JVM spawns a new Java thread, it tries to acquire the loader lock to perform thread initialization before reaching a safepoint. If a GC occurs while an OpenDAL worker thread is shutting down and a new Java thread is being created, a deadlock can occur: the Java thread is blocked waiting for the loader lock, while the OpenDAL worker thread is holding the loader lock and waiting for the GC to complete. ### Steps to Reproduce Here's a simple reproducer, after a few seconds (when opendal executors start increasing) it blocks. ```java package org.example; import org.apache.opendal.AsyncExecutor; import java.util.concurrent.atomic.AtomicLong; public class Main { static final AtomicLong THREAD_COUNTER = new AtomicLong(); static final AtomicLong OPENDAL_COUNTER = new AtomicLong(); public static void main(String[] args) { startOperatorSpawners(); startThreadSpawners(); while (true) { System.err.printf("Still not blocked - threads: %d - opendal executors: %d\n", THREAD_COUNTER.get(), OPENDAL_COUNTER.get()); // Manually trigger a gc to force all threads to stop at a safepoint (unless they stay in native code) System.gc(); } } private static void startThreadSpawners() { for (int i = 0; i < 4; i++) { new Thread(() -> { while (true) { new Thread(() -> { THREAD_COUNTER.addAndGet(1); }).start(); } }).start(); } } private static void startOperatorSpawners() { for (int i = 0; i < 4; i++) { new Thread(() -> { while (true) { // We're using an AsyncExecutor because it will create a new tokio runtime everytime try (final var executor = AsyncExecutor.createTokioExecutor(4)) { OPENDAL_COUNTER.addAndGet(1); } } }).start(); } } } ``` ### Expected Behavior There shouldn't be any deadlocks, the opendal workers should gracefully shutdown and new thread should be able to initialize and work as expected. ### Additional Context Under the hood, opendal uses [jni-rs](https://github.com/jni-rs/jni-rs) to bind the java library to its rust library. When starting a new tokio runtime in [make_tokio_executor](https://github.com/apache/opendal/blob/6ee74a52512ad0fd6e3f2ed3a79932fd23fda774/bindings/java/src/executor.rs#L118), they are automatically attached to the JVM: ```java .on_thread_start(move || { ENV.with(|cell| { let mut env = vm .attach_current_thread_as_daemon() .expect("attach thread must succeed"); set_current_thread_name(&mut env).expect("current thread name has been set above"); *cell.borrow_mut() = Some(env.get_raw()); }) }) ``` Note that there are no `on_thread_stop` to detach the thread from the JVM. OpenDAL rely on `jni-rs` to automatically do it when its `THREAD_ATTACH_GUARD` (which is a `thread_local!`) is dropped (when the TLS destructors are called). There is an issue about this in https://github.com/jni-rs/jni-rs/issues/701. A simple workaround would be to manually detach the current thread from the JVM in `on_thread_stop` as it doesn't hold the Windows loader lock: ```java .on_thread_stop(move || { ENV.take(); unsafe { // SAFETY: the thread is stopping, no more JNI calls are expected stop_vm.detach_current_thread(); } }) ``` ### Are you willing to submit a PR to fix this bug? - [x] Yes, I would like to submit a PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
