jordepic opened a new pull request, #2441: URL: https://github.com/apache/iceberg-rust/pull/2441
## Which issue does this PR close? - Closes #2440. ## What changes are included in this PR? Add an opendal-hdfs-native cargo feature plus OpenDalStorageFactory::Hdfs and OpenDalStorage::Hdfs variants in iceberg-storage-opendal, with a corresponding HdfsConfig in iceberg core. The variant uses OpenDAL's services-hdfs-native, which talks the HDFS RPC protocol directly in pure Rust (no JNI / libhdfs). The new HdfsConfig follows the existing storage configuration pattern in crates/iceberg/src/io/storage/config/ to the letter: typed config with TryFrom<&StorageConfig>, a single HDFS_DEFAULT_NAME_NODE property constant, and a TypedBuilder, mirroring gcs.rs, s3.rs, oss.rs, and azdls.rs. Wire-up in crates/storage/opendal/src/lib.rs and resolving.rs follows the same factory + variant + scheme-routing pattern used by every other OpenDAL-backed storage in that crate. HDFS topology and authentication are deliberately not modelled here. hdfs-native resolves HA name services and namenode RPC addresses from core-site.xml / hdfs-site.xml under HADOOP_CONF_DIR (or HADOOP_HOME/etc/hadoop), and Kerberos authentication is delegated to libgssapi_krb5 via the standard KRB5CCNAME / KRB5_CONFIG env. These deployment requirements are spelled out in the module docs on HdfsConfig (crates/iceberg/src/io/storage/config/hdfs.rs) and the opendal backend module (crates/storage/opendal/src/hdfs.rs). The opendal-hdfs-native feature is added to opendal-all but not to the default feature set; it requires libgssapi_krb5 installed at the OS level for the runtime dlopen (brew install krb5 on macOS, apt install libgssapi-krb5-2 on Debian). OpenDalStorage::Hdfs holds a per-name-node operator cache (Arc<RwLock<HashMap<String, Operator>>> keyed by "hdfs://<authority>"), so multi-nameservice deployments share a single storage instance with per-NS operator caching. OpenDAL's RetryLayer is applied uniformly to every operator at the call site; there is no bespoke retry logic. The integration-test docker-compose fixture mirrors apache/opendal's own HDFS fixture (fixtures/hdfs/docker-compose-hdfs-cluster.yml): the same bde2020 hadoop-namenode and hadoop-datanode images, both running in host network mode. Host networking is required because hdfs-native 0.13.5 connects to the DataNode by IP from DatanodeIdProto.ip_addr; on a docker bridge the DN would register with an unroutable bridge IP. Host networking works on Linux CI runners but has known issues on macOS / Windows Docker Desktop, so the tests are gated behind ICEBERG_TEST_HDFS_INTEGRATION=1, which is set in the existing Linux-only tests job in .github/workflows/ci.yml. ## Are these changes tested? Yes, we have unit tests to ensure appropriate caching per name service as well as integration tests at parity with other storage backends. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
