mbutrovich commented on code in PR #3802:
URL: https://github.com/apache/datafusion-comet/pull/3802#discussion_r3001172217
##########
native/core/src/parquet/parquet_support.rs:
##########
@@ -444,6 +447,32 @@ fn create_hdfs_object_store(
})
}
+type ObjectStoreCache = RwLock<HashMap<(String, u64), Arc<dyn ObjectStore>>>;
+
+/// Global cache of object stores keyed by (url_key, config_hash).
+///
+/// This avoids creating a new S3 client (and thus a new HTTP connection pool
and DNS
+/// resolution) for every Parquet file read. Without this cache, each call to
+/// `initRecordBatchReader` from the JVM creates a fresh `reqwest::Client`,
leading to
+/// excessive DNS queries that can overwhelm DNS resolvers (e.g., Route53
Resolver limits
+/// in EKS environments).
+fn object_store_cache() -> &'static ObjectStoreCache {
+ static CACHE: OnceLock<ObjectStoreCache> = OnceLock::new();
Review Comment:
Anything `static` (global singleton) should be documented why `static` is
the reasonable life cycle. In particular, I wonder about the unbounded size of
a `static` cache, invalidation scenarios (what if a job runs long enough and
needs new credentials passed into the object_store?), and why there was no
other location with a reasonable life cycle to own this cache.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]