xanderbailey commented on code in PR #2441:
URL: https://github.com/apache/iceberg-rust/pull/2441#discussion_r3276460714
##########
crates/storage/opendal/src/lib.rs:
##########
@@ -218,6 +234,20 @@ pub enum OpenDalStorage {
/// Azure DLS configuration.
config: Arc<AzdlsConfig>,
},
+ /// HDFS storage variant (via OpenDAL `services-hdfs-native`).
+ ///
+ /// Accepts paths of the form `hdfs://<authority>/<rel>` (or
+ /// `hdfs:///<rel>` for authority-less paths, which defer to
+ /// `fs.defaultFS` from the `hdfs-native`-loaded Hadoop config). The
+ /// authority - or `None` when absent - keys a per-name-node `Operator`
+ /// cache, so distinct name nodes coexist in a single storage instance.
+ #[cfg(feature = "opendal-hdfs-native")]
+ Hdfs {
+ /// Operator cache. `Some("hdfs://<authority>")` for paths with an
+ /// authority; `None` for authority-less paths (`fs.defaultFS`).
+ #[serde(skip, default)]
+ operators: Arc<RwLock<HashMap<Option<String>, Operator>>>,
Review Comment:
[Java](https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3876-L3880)
caches on (scheme, authority, UGI) which I think means in this case all
principals in the same process would share one operator which seems incorrect
to me? WDYT?
Would it also be defensive to use a moka cache here? Appreciate the size of
this cache should be small though?
##########
dev/docker-compose.yaml:
##########
@@ -147,6 +147,50 @@ services:
timeout: 5s
retries: 5
+ #
=============================================================================
+ # HDFS - single-node NameNode + DataNode for HDFS tests
+ #
=============================================================================
+ # Mirrors apache/opendal's fixtures/hdfs/docker-compose-hdfs-cluster.yml:
+ # same bde2020 images, host networking on both services. Host networking
+ # is required because hdfs-native 0.13.5 connects to the DataNode by IP
+ # from `DatanodeIdProto.ip_addr` (not by hostname). On a docker bridge
+ # the DN would register with an unroutable bridge IP; host networking
+ # lets it bind directly on the host network namespace so the registered
+ # address is host-reachable.
+ #
+ # This works on Linux CI runners. On macOS / Windows Docker Desktop
+ # host networking has known issues (e.g. unresolvable VM hostname), so
+ # the HDFS integration tests are `#[ignore]`d; CI explicitly opts them
+ # in via `cargo nextest --run-ignored=only` (see .github/workflows/ci.yml).
+ hdfs-namenode:
+ image: bde2020/hadoop-namenode:2.0.0-hadoop3.2.1-java8
+ network_mode: "host"
+ environment:
+ CLUSTER_NAME: iceberg-rust-test
+ CORE_CONF_fs_defaultFS: hdfs://localhost:8020
+ CORE_CONF_hadoop_http_staticuser_user: root
+ HDFS_CONF_dfs_permissions_enabled: false
+ HDFS_CONF_dfs_replication: 1
+ healthcheck:
+ test: ["CMD-SHELL", "hdfs dfsadmin -safemode get | grep -q OFF"]
+ interval: 5s
+ timeout: 5s
+ retries: 30
+ start_period: 30s
+
+ hdfs-datanode:
+ image: bde2020/hadoop-datanode:2.0.0-hadoop3.2.1-java8
Review Comment:
Looks like this might be an old unmaintained
[image](https://hub.docker.com/layers/bde2020/hadoop-namenode/2.0.0-hadoop3.2.1-java8/images/sha256-51ad9293ec52083c5003ef0aaab00c3dd7d6335ddf495cc1257f97a272cab4c0)
Looks like apache/hadoop:3.4.3 and apache/hadoop:3.5.0 are more up-to-date?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]