This is an automated email from the ASF dual-hosted git repository. xuanwo pushed a commit to branch main in repository https://gitbox.apache.org/repos/asf/opendal.git
The following commit(s) were added to refs/heads/main by this push: new 98d869827 docs: Add performance guide http optimization (#6020) 98d869827 is described below commit 98d869827159344a68f6ad7ca9cbc61c03bce5f0 Author: Xuanwo <git...@xuanwo.io> AuthorDate: Mon Apr 14 17:29:58 2025 +0800 docs: Add performance guide http optimization (#6020) Signed-off-by: Xuanwo <git...@xuanwo.io> --- core/src/docs/performance/http_optimization.md | 124 +++++++++++++++++++++++++ core/src/docs/performance/mod.rs | 3 + 2 files changed, 127 insertions(+) diff --git a/core/src/docs/performance/http_optimization.md b/core/src/docs/performance/http_optimization.md new file mode 100644 index 000000000..ece626423 --- /dev/null +++ b/core/src/docs/performance/http_optimization.md @@ -0,0 +1,124 @@ +# HTTP Optimization + +All OpenDAL HTTP-based storage services use the same [HttpClient][crate::raw::HttpClient] abstraction. This design offers users a unified interface for configuring HTTP clients. The default HTTP client is [reqwest](https://crates.io/crates/reqwest), a popular and widely used HTTP client library in Rust. + +Many of the services supported by OpenDAL are HTTP-based. This guide aims to provide optimization tips for using HTTP-based storage services. While these tips are also applicable to other HTTP clients, the configuration methods may vary. + +Please note that the following optimizations are based on experience and may not be suitable for all scenarios. The most effective way to determine the optimal configuration is to test it in your specific environment. + +## HTTP/1.1 + +According to benchmarks from OpenDAL users, `HTTP/1.1` is generally faster than `HTTP/2` for large-scale download and upload operations. + +`reqwest` tends to maintain only a single TCP connection for `HTTP/2`, relying on its built-in multiplexing capabilities. While this works well for small files, such as web page downloads, the design is not ideal for handling large files or massive file scan OLAP workloads. + +When `HTTP/2` is disabled, `reqwest` falls back to `HTTP/1.1` and utilizes its default connection pool. This approach is better suited for large files, as it allows multiple TCP connections to be opened and used concurrently, significantly improving performance for large file downloads and uploads. + +If your workloads involve large files or require high throughput, and are not sensitive to latency, consider disabling `HTTP/2` in your configuration. + +```rust +let client = reqwest::ClientBuilder::new() + // Disable http2 for better performance. + .http1_only() + .build() + .expect("http client must be created"); + +// Update the http client in the operator. +let op = op.update_http_client(|_| HttpClient::with(client)); +``` + +## DNS Caching + +`reqwest` uses the DNS resolver provided by Rust's standard library by default, which is backed by the `getaddrinfo` system call under the hood. This system call does not cache results by default, meaning that each time you make a request to a new domain, a DNS lookup will be performed. + +Under high-throughput workloads, this can cause a significant performance degradation, as each request incurs the overhead of a DNS lookup. It can also negatively affect the resolver, potentially overwhelming it with the volume of requests. In extreme cases, this may result in a DoS attack on the resolver, rendering it unresponsive. + +To mitigate this issue, you can enable DNS caching in `reqwest` by using the `hickory-dns` feature. This feature provides a more efficient DNS resolver that caches results. + +```rust +let client = reqwest::ClientBuilder::new() + // Enable hickory dns for dns caching and async dns resolve. + .hickory_dns(true) + .build() + .expect("http client must be created"); + +// Update the http client in the operator. +let op = op.update_http_client(|_| HttpClient::with(client)); +``` + +The default DNS cache settings from `hickory_dns` are generally sufficient for most workloads. However, if you have specific requirements—such as sharing the same DNS cache across multiple HTTP clients or configuring the DNS cache size—you can use the `Xuanwo/reqwest-hickory-resolver` crate to set up a custom DNS resolver. + +```rust +/// Global shared hickory resolver. +static GLOBAL_HICKORY_RESOLVER: LazyLock<Arc<HickoryResolver>> = LazyLock::new(|| { + let mut opts = ResolverOpts::default(); + // Only query for the ipv4 address. + opts.ip_strategy = LookupIpStrategy::Ipv4Only; + // Use larger cache size for better performance. + opts.cache_size = 1024; + // Positive TTL is set to 5 minutes. + opts.positive_min_ttl = Some(Duration::from_secs(300)); + // Negative TTL is set to 1 minute. + opts.negative_min_ttl = Some(Duration::from_secs(60)); + + Arc::new( + HickoryResolver::default() + // Always shuffle the DNS results for better performance. + .with_shuffle(true) + .with_options(opts), + ) +}); + +let client = reqwest::ClientBuilder::new() + // Use our global hickory resolver instead. + .dns_resolver(GLOBAL_HICKORY_RESOLVER.clone()) + .build() + .expect("http client must be created"); + +// Update the http client in the operator. +let op = op.update_http_client(|_| HttpClient::with(client)); +``` + +The `ResolverOpts` has many options that can be configured. For a complete list of options, please refer to the [hickory_resolver documentation](https://docs.rs/hickory-resolver/latest/hickory_resolver/config/struct.ResolverOpts.html). + +Here is a summary of the most commonly used options: + +- `ip_strategy`: `hickory_resolver` default to use `Ipv4thenIpv6` strategy, which means it will first query for the IPv4 address and then the IPv6 address. This is generally a good strategy for most workloads. However, if you only need IPv4 addresses, you can set this option to `Ipv4Only` to avoid unnecessary DNS lookups. +- `cache_size`: This option controls the size of the DNS cache. A larger cache size can improve performance, but it may also increase memory usage. The default value is `32`. +- `positive_min_ttl` and `negative_min_ttl`: This option controls the minimum TTL for positive and negative DNS responses. A longer TTL can improve performance, but it may also increase the risk of stale DNS records. The default value is `None`. Some bad DNS servers may return a TTL of `0` even when the record is valid. In this case, you can set a longer TTL to avoid unnecessary DNS lookups. + +In addition to the options mentioned above, `Xuanwo/reqwest-hickory-resolver` also offers a `shuffle` option. This setting determines whether the DNS results are shuffled before being returned. Shuffling can enhance performance by distributing the load more evenly across multiple IP addresses. + +## Timeout + +`reqwest` didn't set a default timeout for HTTP requests. This means that if a request hangs or takes too long to complete, it can block the entire process, leading to performance degradation or even application crashes. + +It's recommended to set a connect timeout for HTTP requests to prevent this issue. + +```rust +let client = reqwest::ClientBuilder::new() + // Set a connect timeout of 5 seconds. + .connect_timeout(Duration::from_secs(5)) + .build() + .expect("http client must be created"); + +// Update the http client in the operator. +let op = op.update_http_client(|_| HttpClient::with(client)); +``` + +It's also recommended to use opendal's [`TimeoutLayer`][crate::layers::TimeoutLayer] to prevent slow requests hangs forever. This layer will automatically cancel the request if it takes too long to complete. + +```rust +let op = op.layer(TimeoutLayer::new()); +``` + +## Connection Pool + +`reqwest` uses a connection pool to manage HTTP connections. This allows multiple requests to share the same connection, reducing the overhead of establishing new connections for each request. + +By default, the connection pool is unlimited, allowing `reqwest` to open as many connections as needed. The default keep-alive timeout is 90 seconds, meaning any connection idle for longer than that will be closed. + +You can tune those settings via: + +- [pool_idle_timeout](https://docs.rs/reqwest/0.12.15/reqwest/struct.ClientBuilder.html#method.pool_idle_timeout): Set an optional timeout for idle sockets being kept-alive. +- [pool_max_idle_per_host](https://docs.rs/reqwest/0.12.15/reqwest/struct.ClientBuilder.html#method.pool_max_idle_per_host): Sets the maximum idle connection per host allowed in the pool. diff --git a/core/src/docs/performance/mod.rs b/core/src/docs/performance/mod.rs index 25f405df3..a817b8479 100644 --- a/core/src/docs/performance/mod.rs +++ b/core/src/docs/performance/mod.rs @@ -27,3 +27,6 @@ #[doc = include_str!("concurrent_write.md")] pub mod concurrent_write {} + +#[doc = include_str!("http_optimization.md")] +pub mod http_optimization {}