andygrove opened a new pull request, #3802: URL: https://github.com/apache/datafusion-comet/pull/3802
## Which issue does this PR close? <!--Closes #.--> ## Rationale for this change When reading Parquet files from S3, each call to `initRecordBatchReader` creates a new `SessionContext`, `RuntimeEnv`, and S3 `ObjectStore` client. Each `ObjectStore` instance creates a new `reqwest` HTTP client with its own connection pool, requiring fresh DNS resolution for every file opened. In TPC-DS-scale workloads with thousands of Parquet files, this generates excessive DNS queries that can overwhelm DNS resolvers (e.g., Route53 Resolver limits in EKS environments), causing `UnknownHostException` errors, intermittent S3 connectivity failures, and job failures under high concurrency. Native Spark does not have this problem because Hadoop's `S3AFileSystem` maintains singleton instances per filesystem URI (via `FileSystem.get()` cache) and benefits from JVM-level DNS caching. Comet's native Rust layer bypasses all of this. Additionally, when no region is explicitly configured, each file read triggers a `HeadBucket` API call with a throwaway `reqwest::Client` to auto-detect the bucket region, doubling the DNS query volume. ## What changes are included in this PR? Two caches are introduced in the native Rust layer: 1. **ObjectStore cache** (`parquet_support.rs`): A global cache keyed by `(URL prefix, config hash)` in `prepare_object_store_with_configs()`. Subsequent reads from the same bucket with the same configuration reuse the existing `ObjectStore` instance, enabling HTTP connection pooling via `reqwest` and eliminating redundant DNS lookups. 2. **Region resolution cache** (`s3.rs`): A per-bucket cache in `resolve_bucket_region()` that ensures the `HeadBucket` API call (used to auto-detect bucket region when not explicitly configured) happens only once per bucket per JVM lifetime. Together these changes reduce DNS query volume from O(num_files) to O(num_distinct_buckets) per executor. **Workaround for users on current versions:** Set `fs.s3a.endpoint.region` explicitly to avoid the `HeadBucket` auto-detection entirely. ## How are these changes tested? - Existing unit tests in `parquet_support.rs` continue to pass (they test local filesystem paths which exercise the same `prepare_object_store_with_configs` code path). - Clippy passes with no warnings. - The caching is transparent — it returns the same `ObjectStore` instances that would have been created without caching, so existing S3 integration tests cover correctness. - Manual validation recommended in an EKS environment with S3-backed TPC-DS workloads to confirm DNS query volume reduction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
