comphead commented on PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3053067000
> I took a look at the `datafusion-comet-objectstore-hdfs` module Comet and found that it largely overlaps with the Hadoop FileSystem bridge we are building here. A better approach is to reuse `datafusion-comet-objectstore-hdfs` but find a way to pass additional Hadoop configurations to it. Users may configuring credentials for accessing the storage in Spark configuration so passing them correctly when constructing the `ObjectStore` is necessary. > > `datafusion-comet-objectstore-hdfs` works as follows: > > ``` > FFI JNI > datafusion-comet-objectstore-hdfs --> fs-hdfs -----> libhdfs -----> Hadoop File System (JVM) > ``` > > `libhdfs` and `fs-hdfs` should be able to support all Hadoop File System implementations, not just HDFS. The current problem is that `fs-hdfs` does not provide a way to instantiate an `HdfsFs` instance using custom Hadoop configurations. `libhdfs` does provide `hdfsBuilderConfSetStr`, so we need to open up new APIs in `fs-hdfs` to make use of it. > > BTW, is there any concern enabling hdfs support by default and switching the default fs-hdfs dependency to `fs-hdfs3`? > > https://github.com/apache/datafusion-comet/blob/d885f4a5fdd4a9f249523777e8e590f3eee0e2f7/native/hdfs/Cargo.toml#L34-L37 Thanks @Kontinuation I was about to create a PR to enable hdfs support by default @kazuyukitanimura cc One stopper for me was that HDFS is less popular now and probably has not that many consumers to be enabled by default. But we probably should do it, hopefully the binary would be much bigger in this case. for `fs-hdfs3` let me check that. Reg to configuring the Hadoop Client from Rust side I used the command line ``` // LIBHDFS3_CONF=/Users/ovoievodin/tmp/hadoop-3.2.4/etc/hadoop/ JAVA_HOME="/opt/homebrew/opt/openjdk@11" RUST_BACKTRACE=1 RUSTFLAGS="-L /opt/homebrew/opt/openjdk@11/libexec/openjdk.jdk///Contents/Home/lib/server" cargo test --lib tests::test_read_hdfs -- --nocapture ``` So the Rust was able to get the Hadoop client configurations. If running from Comet, hadoop configuration can passed through Spark passing `hdfs-site.xml` to classpath or using `spark.hadoop.*` params like in https://datafusion.apache.org/comet/user-guide/datasources.html#using-experimental-native-datafusion-reader But if it could be improved feel free to extend the `hdfs` crate so the user can instantiate `HdfsFs` correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org