Kontinuation commented on PR #1992:
URL: 
https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3064666622

   There are some problems with the approach of using fs-hdfs (libhdfs).
   
   ### Problems
   
   #### 1. Linking against libjvm.so
   
   As we discovered before, using fs-hdfs makes `libcomet` dynamically link 
against `libjvm.so`. `libjvm.so` is only necessary when we want to call the 
[Java invocation 
API](https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/invocation.html),
 these APIs are for creating new Java VMs or finding existing VMs, etc. These 
APIs are not needed by JNI modules, because the JNI native functions [accept a 
JNIEnv 
parameter](https://github.com/apache/datafusion-comet/blob/0.9.0/native/core/src/execution/jni_api.rs#L146-L147),
 there's no need for Java VM discovery or creation.
   
   Here are the problems with linking against libjvm.so
   
   1. **This is not a standard practice of building JNI modules**. I have 
inspected some commonly used JNI libraries including snappy-java, lz4-java, 
zstd-jni, arrow_cdata_jni, blas, etc. None of them link against libjvm.so.
   2. **This introduces additional friction for development**. When hdfs 
feature is enabled, all the test binaries will link against libjvm.so, and 
there's a high chance that the tests won't run out-of-box. These are the 
failures I got when running `cargo test --features hdfs`:
   
   Linux:
   
   ```
        **Running** unittests src/lib.rs 
(target/debug/deps/comet-b5e7808e7551d6a8)
   
   
/home/kontinuation/documents/github/datafusion-comet/native/target/debug/deps/comet-b5e7808e7551d6a8:
 error while loading shared libraries: libjvm.so: cannot open shared object 
file: No such file or directory
   
   **error****:** test failed, to rerun pass `-p datafusion-comet --lib`
   ```
   
   macOS:
   
   ```
   Running unittests src/lib.rs (target/debug/deps/comet-2a4d21387bc9c825)
   dyld[61398]: Library not loaded: @rpath/libjvm.dylib
     Referenced from: <D3FE1520-33E0-3D0F-BDDB-AF7ADB755CBE> 
/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/deps/comet-2a4d21387bc9c825
     Reason: tried: 
'/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/aws-lc-sys-93a6aa5d6483d3f7/out/libjvm.dylib'
 (no such file), 
'/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/blake3-0740d36ff36fef54/out/libjvm.dylib'
 (no such file), 
'/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/fs-hdfs3-ef462506593c877b/out/libjvm.dylib'
 (no such file), 
'/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/psm-0620469b47746bcc/out/libjvm.dylib'
 (no such file), 
'/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/ring-c54438584ee3a8bc/out/libjvm.dylib'
 (no such file), 
'/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/build/zstd-sys-948618cdd9c42d49/out/libjvm.dylib'
 (no such file), 
'/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/deps/libjvm.dylib'
 (no such file), 
'/Users/bopeng/workspace/github/datafusion-comet/native/target/debug/libj
 vm.dylib' (no such file), 
'/Users/bopeng/.rustup/toolchains/stable-aarch64-apple-darwin/lib/rustlib/aarch64-apple-darwin/lib/libjvm.dylib'
 (no such file), 
'/Users/bopeng/.rustup/toolchains/stable-aarch64-apple-darwin/lib/libjvm.dylib' 
(no such file), '/Users/bopeng/lib/libjvm.dylib' (no such file), 
'/usr/local/lib/libjvm.dylib' (no such file), '/usr/lib/libjvm.dylib' (no such 
file, not in dyld cache)
   error: test failed, to rerun pass `-p datafusion-comet --lib`
   ```
   
   We have to point `LD_LIBRARY_PATH` or `DYLD_LIBRARY_PATH` to the directory 
containing `libjvm` to run those tests. This is quite inconvenient.
   
   #### 2. Problem with `AttachCurrentThread` and `DetachCurrentThread`
   
   The JNIEnv retrieval and thread attachment management policy of Comet and 
libhdfs does not play well with each other.
   
   * Comet [uses 
`JavaVM::attach_current_thread`](https://github.com/apache/datafusion-comet/blob/0.9.0/native/core/src/jvm_bridge/mod.rs#L270-L280)
 to obtain a `JNIEnv` for calling Java methods in multiple places, it 
automatically calls `AttachCurrentThread` if necessary, and returns an 
AttachGuard to automatically call `DetachCurrentThread` when the JNIEnv is 
dropped.
   * libhdfs [always call 
`AttachCurrentThread`](https://github.com/datafusion-contrib/fs-hdfs/blob/hadoop-3.1.4/c_src/libhdfs/jni_helper.c#L743-L749)
 when obtaining `JNIEnv`,  and cache the `JNIEnv` handle to thread local 
variables. It will assume that `DetachCurrentThread` will never be called and 
the cached `JNIEnv` will always be valid.
   
   There might be cases where Comet calls `DetachCurrentThread` at some point 
then `libhdfs` tries to reuse the stale cached thread local JNIEnv, This may 
happen in theory, I'm not sure if this problem has been observed in practice.
   
   ### Solutions
   
   I am trying to solve these problems by adding a new `no_jvm_invocation` 
feature to `fs-hdfs` (https://github.com/datafusion-contrib/fs-hdfs/pull/26). 
When this feature is enabled, `libjvm.so` won't be linked. It adds a new API 
`set_java_vm` and requires the user to provide a handle to a `JavaVM` before 
calling other functions. This eliminates the need for discovering VMs by 
calling `JNI_GetCreatedJavaVMs`, which is an invocation API requires libjvm.
   
   The policy for managing JNIEnv is also revised:
   
   * It tries to obtain JNIEnv from JavaVM by calling `(*vm)->GetEnv`. If it 
succeeded, the thread was already attached by the caller and may be detached by 
the caller in the future, libhdfs won't cache the JNIEnv in thread local 
storage in this case.
   * If `(*vm)->GetEnv` fails, it calls `AttachCurrentThread` to obtain an 
JNIEnv, and cache the JNIEnv in thread local storage. The cached JNIEnv will be 
reused within this thread. When this thread terminates, `DetachCurrentThread` 
will be called to free the resources.
   
   This policy ensures that only the caller of `AttachCurrentThread` would call 
`DetachCurrentThread`, this is a more easy to follow principle than the mess of 
current status.
   
   ### Other Considerations
   
   * The complexity of this approach may be higher than what we expected before
   * The code quality of `libhdfs` is ... questionable. I'm not sure if it is a 
good idea to build general Hadoop FileSystem support upon it anymore.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to