RaphDal opened a new issue, #6869:
URL: https://github.com/apache/opendal/issues/6869

   ### Describe the bug
   
   Tokio worker threads used by OpenDAL rely on thread-local storage (TLS) in 
`jni-rs` to detach themselves from the JVM on shutdown.
   Detaching from the JVM requires calling back into the JVM, which may block 
if a GC is in progress. When this happens from a TLS destructor, the code runs 
while the Windows [loader 
lock](https://learn.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-best-practices)
 is held. At the same time, when the JVM spawns a new Java thread, it tries to 
acquire the loader lock to perform thread initialization before reaching a 
safepoint.
   
   If a GC occurs while an OpenDAL worker thread is shutting down and a new 
Java thread is being created, a deadlock can occur: the Java thread is blocked 
waiting for the loader lock, while the OpenDAL worker thread is holding the 
loader lock and waiting for the GC to complete.
   
   ### Steps to Reproduce
   
   Here's a simple reproducer, after a few seconds (when opendal executors 
start increasing) it blocks.
   ```java
   package org.example;
   
   import org.apache.opendal.AsyncExecutor;
   
   import java.util.concurrent.atomic.AtomicLong;
   
   public class Main {
       static final AtomicLong THREAD_COUNTER = new AtomicLong();
       static final AtomicLong OPENDAL_COUNTER = new AtomicLong();
   
       public static void main(String[] args) {
           startOperatorSpawners();
           startThreadSpawners();
           while (true) {
               System.err.printf("Still not blocked - threads: %d - opendal 
executors: %d\n", THREAD_COUNTER.get(), OPENDAL_COUNTER.get());
               // Manually trigger a gc to force all threads to stop at a 
safepoint (unless they stay in native code)
               System.gc();
           }
       }
   
       private static void startThreadSpawners() {
           for (int i = 0; i < 4; i++) {
               new Thread(() -> {
                   while (true) {
                       new Thread(() -> {
                           THREAD_COUNTER.addAndGet(1);
                       }).start();
                   }
               }).start();
           }
       }
   
       private static void startOperatorSpawners() {
           for (int i = 0; i < 4; i++) {
               new Thread(() -> {
                   while (true) {
                       // We're using an AsyncExecutor because it will create a 
new tokio runtime everytime
                       try (final var executor = 
AsyncExecutor.createTokioExecutor(4)) {
                           OPENDAL_COUNTER.addAndGet(1);
                       }
                   }
               }).start();
           }
       }
   }
   ```
   
   ### Expected Behavior
   
   There shouldn't be any deadlocks, the opendal workers should gracefully 
shutdown and new thread should be able to initialize and work as expected.
   
   ### Additional Context
   
   Under the hood, opendal uses [jni-rs](https://github.com/jni-rs/jni-rs) to 
bind the java library to its rust library.
   When starting a new tokio runtime in 
[make_tokio_executor](https://github.com/apache/opendal/blob/6ee74a52512ad0fd6e3f2ed3a79932fd23fda774/bindings/java/src/executor.rs#L118),
 they are automatically attached to the JVM:
   ```java
           .on_thread_start(move || {
               ENV.with(|cell| {
                   let mut env = vm
                       .attach_current_thread_as_daemon()
                       .expect("attach thread must succeed");
   
                   set_current_thread_name(&mut env).expect("current thread 
name has been set above");
   
                   *cell.borrow_mut() = Some(env.get_raw());
               })
           })
   ```
   
   Note that there are no `on_thread_stop` to detach the thread from the JVM.
   OpenDAL rely on `jni-rs` to automatically do it when its 
`THREAD_ATTACH_GUARD` (which is a `thread_local!`) is dropped (when the TLS 
destructors are called).
   
   There is an issue about this in https://github.com/jni-rs/jni-rs/issues/701.
   
   A simple workaround would be to manually detach the current thread from the 
JVM in `on_thread_stop` as it doesn't hold the Windows loader lock:
   ```java
   .on_thread_stop(move || {
       ENV.take();
       unsafe {
           // SAFETY: the thread is stopping, no more JNI calls are expected
           stop_vm.detach_current_thread();
       }
   })
   ```
   
   ### Are you willing to submit a PR to fix this bug?
   
   - [x] Yes, I would like to submit a PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to