crepererum opened a new issue, #7117: URL: https://github.com/apache/arrow-rs/issues/7117
# Problem Description This is specific to AWS S3. Note that S3 only supports HTTP/1.1, so **no** connection multiplexing will happening. This means that two concurrent requests will use different TCP+TLS connections. If you issue two or more requests to S3 at the same time (to the **same** region + bucket), all of these will use the **same** S3 IP address, even though S3 advertises multiple addresses in the DNS response (see DNS analysis below). This happens even when these requests are issued from **different** `ObjectStore` instances (see resolver analysis on why this is happening). This behavior was confirmed using network traffic analysis using [Wireshark](https://www.wireshark.org/). This is bad for the following reasons: ## Performance It is way more likely that you overload a single S3 server. ## Latency Racing (= Racing Reads) In theory an `object_store` user could race two requests (esp. `GET` requests) to the same object hoping that one of them will be faster. There's evidence that this works: - [Performance guidelines for Amazon S3 ⇒ Retry requests for latency-sensitive applications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html#optimizing-performance-guidelines-retry) - [The Five-Minute Rule for the Cloud: Caching in Analytics Systems ⇒ 5.2 Latency-Sensitive Workloads ⇒ Racing reads](https://vldb.org/cidrdb/papers/2025/p4-duwe.pdf) Note that this trades cost (via number of requests) for improved tail latency. However if you connect to the same S3 server on all racing parts, this is way less likely to work. ## Fault Tolerance Since an S3 server might be down, concentrating all requests on one server may elevate this issue. ## Persistence Since the HTTP/1.1 connections are kept alive (mostly until the AWS side terminates them), this server pinning can persist long after the first requests are made. # Technical Analysis To understand why this is happening, we need to look at different parts of the stack. ## DNS Resolving the S3 IP looks like this on the DNS layer (captured using [Wireshark](https://www.wireshark.org/)) <details> ```text Domain Name System (response) Transaction ID: 0x07d6 Flags: 0x8180 Standard query response, No error Questions: 1 Answer RRs: 8 Authority RRs: 0 Additional RRs: 1 Queries s3.us-east-1.amazonaws.com: type A, class IN Name: s3.us-east-1.amazonaws.com [Name Length: 26] [Label Count: 4] Type: A (1) (Host Address) Class: IN (0x0001) Answers s3.us-east-1.amazonaws.com: type A, class IN, addr 16.182.97.32 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 16.182.97.32 s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.46.62 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 52.217.46.62 s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.4.118 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 52.217.4.118 s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.36.80 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 52.216.36.80 s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.38.224 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 52.216.38.224 s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.51.128 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 52.216.51.128 s3.us-east-1.amazonaws.com: type A, class IN, addr 3.5.1.11 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 3.5.1.11 s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.204.96 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 54.231.204.96 Additional records <Root>: type OPT Name: <Root> Type: OPT (41) UDP payload size: 1232 Higher bits in extended RCODE: 0x00 EDNS0 version: 0 Z: 0x0000 0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs .000 0000 0000 0000 = Reserved: 0x0000 Data length: 285 Option: PADDING [Request In: 1627] [Time: 0.072296785 seconds] ``` </details> i.e. that's 8 different IPs with a 5s TTL. If we ask again later, we'll get a slightly different response: <details> ```text Domain Name System (response) Transaction ID: 0x817a Flags: 0x8180 Standard query response, No error Questions: 1 Answer RRs: 8 Authority RRs: 0 Additional RRs: 1 Queries s3.us-east-1.amazonaws.com: type A, class IN Name: s3.us-east-1.amazonaws.com [Name Length: 26] [Label Count: 4] Type: A (1) (Host Address) Class: IN (0x0001) Answers s3.us-east-1.amazonaws.com: type A, class IN, addr 16.182.102.192 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 16.182.102.192 s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.136.77 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 52.216.136.77 s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.204.8 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 52.217.204.8 s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.196.248 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 54.231.196.248 s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.236.208 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 54.231.236.208 s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.201.80 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 52.217.201.80 s3.us-east-1.amazonaws.com: type A, class IN, addr 3.5.31.42 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 3.5.31.42 s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.128.224 Name: s3.us-east-1.amazonaws.com Type: A (1) (Host Address) Class: IN (0x0001) Time to live: 5 (5 seconds) Data length: 4 Address: 52.217.128.224 Additional records <Root>: type OPT Name: <Root> Type: OPT (41) UDP payload size: 1232 Higher bits in extended RCODE: 0x00 EDNS0 version: 0 Z: 0x0000 0... .... .... .... = DO bit: Cannot handle DNSSEC security RRs .000 0000 0000 0000 = Reserved: 0x0000 Data length: 285 Option: PADDING [Request In: 67958] [Time: 0.140540958 seconds] ``` </details> I've search through the DNS-related RFCs but couldn't find a suggestion if the order is important or not, but _the internet_ ([1](https://serverfault.com/questions/906900/importance-of-the-answer-order-in-a-dns-lookup), [2](https://serverfault.com/questions/264799/what-is-the-purpose-of-a-dns-server-returning-more-than-1-a-record), [3](https://superuser.com/questions/1736076/when-i-access-a-server-that-has-multiple-dns-ip-addresses-which-one-do-i-use)) suggests that most implementation use the IPs in order (using the next one with a timeout) but that the standard actually makes NO claim on that front. ## Resolver [`reqwest`](https://github.com/seanmonstar/reqwest) -- which is the high-level HTTP client library that `object_store` uses -- has a high-level interface called [`Resolve`](https://docs.rs/reqwest/latest/reqwest/dns/trait.Resolve.html) which resolves one host name to **multiple** IP addresses. By default [`reqwest`](https://github.com/seanmonstar/reqwest) uses [`getaddrinfo`](https://man.archlinux.org/man/getaddrinfo.3) (see [1](https://github.com/seanmonstar/reqwest/blob/37074368012ce42e61e5649c2fffcf8c8a979e1e/src/async_impl/client.rs#L319-L325), [2](https://github.com/seanmonstar/reqwest/blob/37074368012ce42e61e5649c2fffcf8c8a979e1e/src/dns/gai.rs#L8-L32), [3](https://github.com/hyperium/hyper-util/blob/46826ea75836852fac53ff075a12cba7e290946e/src/client/legacy/connect/dns.rs#L43-L47)), i.e. the system resolver. That one will very likely cache resolution based on the 5s TTL (see above). In fact I can see that behavior using [Wireshark](https://www.wireshark.org/). ## Address Usage Now how are these multiple addresses used: If you search through the code, you'll eventually get [here](https://github.com/hyperium/hyper-util/blob/46826ea75836852fac53ff075a12cba7e290946e/src/client/legacy/connect/http.rs#L697-L722) and see that [`hyper-util`](https://github.com/hyperium/hyper-util) (used by [`reqwest`](https://github.com/seanmonstar/reqwest) for the wiring of low-level components) will try to connect to the IP addresses in order and will only continue of the connection cannot be established or a timeout occurs. So in the _happy path_ this will always connect to the first address. # Solutions I think we should keep using [`reqwest`](https://github.com/seanmonstar/reqwest) since in general it serves us well. So a natural way to change the current behavior would be using the aforementioned [`Resolve`](https://docs.rs/reqwest/latest/reqwest/dns/trait.Resolve.html) interface. I see two general options, both as extensions to [`ClientOptions`](https://docs.rs/object_store/latest/object_store/struct.ClientOptions.html). ## A: Expose `Resolve` Add a way for users to specify their own [`Resolve`](https://docs.rs/reqwest/latest/reqwest/dns/trait.Resolve.html) implementation. **Pros:** - users can also implement other resolver sources, caching, metrics & logs (e.g. to debug broken DNS setup) **Cons:** - users need to write more code to get an arguably "reasonable" behavior ## B: Add `randomize_addrs` flag Add a flag `randomize_addrs`. If it is set to `true` (by default?), then `object_store` will wrap the default resolver and shuffle the addresses before returning it back to [`reqwest`](https://github.com/seanmonstar/reqwest). **Pros:** - sensible default **Cons:** - less extensible -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org