jordepic opened a new pull request, #2441:
URL: https://github.com/apache/iceberg-rust/pull/2441

   ## Which issue does this PR close?
   
   - Closes #2440.
   
   ## What changes are included in this PR?
   
   Add an opendal-hdfs-native cargo feature plus
   OpenDalStorageFactory::Hdfs and OpenDalStorage::Hdfs variants in 
iceberg-storage-opendal, with a corresponding HdfsConfig in iceberg core. The 
variant uses OpenDAL's services-hdfs-native, which talks the HDFS RPC protocol 
directly in pure Rust (no JNI / libhdfs).
   
   The new HdfsConfig follows the existing storage configuration pattern in 
crates/iceberg/src/io/storage/config/ to the letter: typed config with 
TryFrom<&StorageConfig>, a single HDFS_DEFAULT_NAME_NODE property constant, and 
a TypedBuilder, mirroring gcs.rs, s3.rs, oss.rs, and azdls.rs. Wire-up in 
crates/storage/opendal/src/lib.rs and resolving.rs follows the same factory + 
variant + scheme-routing pattern used by every other OpenDAL-backed storage in 
that crate.
   
   HDFS topology and authentication are deliberately not modelled here. 
hdfs-native resolves HA name services and namenode RPC addresses from 
core-site.xml / hdfs-site.xml under HADOOP_CONF_DIR (or 
HADOOP_HOME/etc/hadoop), and Kerberos authentication is delegated to 
libgssapi_krb5 via the standard KRB5CCNAME / KRB5_CONFIG env. These deployment 
requirements are spelled out in the module docs on HdfsConfig 
(crates/iceberg/src/io/storage/config/hdfs.rs) and the opendal backend module 
(crates/storage/opendal/src/hdfs.rs). The opendal-hdfs-native feature is added 
to opendal-all but not to the default feature set; it requires libgssapi_krb5 
installed at the OS level for the runtime dlopen (brew install krb5 on macOS, 
apt install libgssapi-krb5-2 on Debian).
   
   OpenDalStorage::Hdfs holds a per-name-node operator cache 
(Arc<RwLock<HashMap<String, Operator>>> keyed by
   "hdfs://<authority>"), so multi-nameservice deployments share a single 
storage instance with per-NS operator caching. OpenDAL's RetryLayer is applied 
uniformly to every operator at the call site; there is no bespoke retry logic.
   
   The integration-test docker-compose fixture mirrors apache/opendal's own 
HDFS fixture (fixtures/hdfs/docker-compose-hdfs-cluster.yml): the same bde2020 
hadoop-namenode and hadoop-datanode images, both running in host network mode. 
Host networking is required because hdfs-native 0.13.5 connects to the DataNode 
by IP from DatanodeIdProto.ip_addr; on a docker bridge the DN would register 
with an unroutable bridge IP. Host networking works on Linux CI runners but has 
known issues on macOS / Windows Docker Desktop, so the tests are gated behind 
ICEBERG_TEST_HDFS_INTEGRATION=1, which is set in the existing Linux-only tests 
job in
   .github/workflows/ci.yml.
   
   ## Are these changes tested?
   
   Yes, we have unit tests to ensure appropriate caching per name service as 
well as integration tests at parity with other storage backends.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to