comphead commented on PR #1992:
URL: 
https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3053067000

   > I took a look at the `datafusion-comet-objectstore-hdfs` module Comet and 
found that it largely overlaps with the Hadoop FileSystem bridge we are 
building here. A better approach is to reuse 
`datafusion-comet-objectstore-hdfs` but find a way to pass additional Hadoop 
configurations to it. Users may configuring credentials for accessing the 
storage in Spark configuration so passing them correctly when constructing the 
`ObjectStore` is necessary.
   > 
   > `datafusion-comet-objectstore-hdfs` works as follows:
   > 
   > ```
   >                                                FFI            JNI
   > datafusion-comet-objectstore-hdfs --> fs-hdfs -----> libhdfs -----> Hadoop 
File System (JVM)
   > ```
   > 
   > `libhdfs` and `fs-hdfs` should be able to support all Hadoop File System 
implementations, not just HDFS. The current problem is that `fs-hdfs` does not 
provide a way to instantiate an `HdfsFs` instance using custom Hadoop 
configurations. `libhdfs` does provide `hdfsBuilderConfSetStr`, so we need to 
open up new APIs in `fs-hdfs` to make use of it.
   > 
   > BTW, is there any concern enabling hdfs support by default and switching 
the default fs-hdfs dependency to `fs-hdfs3`?
   > 
   > 
https://github.com/apache/datafusion-comet/blob/d885f4a5fdd4a9f249523777e8e590f3eee0e2f7/native/hdfs/Cargo.toml#L34-L37
   
   Thanks @Kontinuation I was about to create a PR to enable hdfs support by 
default @kazuyukitanimura cc
   One stopper for me was that HDFS is less popular now and probably has not 
that many consumers to be enabled by default. But we probably should do it, 
hopefully the binary would be much bigger in this case.
   
   for `fs-hdfs3` let me check that.
   
   Reg to configuring the Hadoop Client from Rust side I used the command line
   ```
       // LIBHDFS3_CONF=/Users/ovoievodin/tmp/hadoop-3.2.4/etc/hadoop/ 
JAVA_HOME="/opt/homebrew/opt/openjdk@11" RUST_BACKTRACE=1 RUSTFLAGS="-L 
/opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk///Contents/Home/lib/server" 
cargo test --lib tests::test_read_hdfs -- --nocapture
   ```
   
   So the Rust was able to get the Hadoop client configurations. 
   If running from Comet, hadoop configuration can passed through Spark passing 
`hdfs-site.xml` to classpath or using `spark.hadoop.*` params like in 
https://datafusion.apache.org/comet/user-guide/datasources.html#using-experimental-native-datafusion-reader
   
   But if it could be improved feel free to extend the `hdfs` crate so the user 
can instantiate `HdfsFs` correctly.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to