crepererum opened a new issue, #7117:
URL: https://github.com/apache/arrow-rs/issues/7117

   # Problem Description
   This is specific to AWS S3. Note that S3 only supports HTTP/1.1, so **no** 
connection multiplexing will happening. This means that two concurrent requests 
will use different TCP+TLS connections.
   
   If you issue two or more requests to S3 at the same time (to the **same** 
region + bucket), all of these will use the **same** S3 IP address, even though 
S3 advertises multiple addresses in the DNS response (see DNS analysis below). 
This happens even when these requests are issued from **different** 
`ObjectStore` instances (see  resolver analysis on why this is happening). This 
behavior was confirmed using network traffic analysis using 
[Wireshark](https://www.wireshark.org/). This is bad for the following reasons:
   
   ## Performance
   It is way more likely that you overload a single S3 server.
   
   ## Latency Racing (= Racing Reads)
   In theory an `object_store` user could race two requests (esp. `GET` 
requests) to the same object hoping that one of them will be faster. There's 
evidence that this works:
   
   - [Performance guidelines for Amazon S3 ⇒ Retry requests for 
latency-sensitive 
applications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html#optimizing-performance-guidelines-retry)
   - [The Five-Minute Rule for the Cloud: Caching in Analytics Systems ⇒ 5.2 
Latency-Sensitive Workloads ⇒ Racing 
reads](https://vldb.org/cidrdb/papers/2025/p4-duwe.pdf)
   
   Note that this trades cost (via number of requests) for improved tail 
latency. However if you connect to the same S3 server on all racing parts, this 
is way less likely to work.
   
   ## Fault Tolerance
   Since an S3 server might be down, concentrating all requests on one server 
may elevate this issue.
   
   ## Persistence
   Since the HTTP/1.1 connections are kept alive (mostly until the AWS side 
terminates them), this server pinning can persist long after the first requests 
are made.
   
   # Technical Analysis
   To understand why this is happening, we need to look at different parts of 
the stack.
   ## DNS
   Resolving the S3 IP looks like this on the DNS layer (captured using 
[Wireshark](https://www.wireshark.org/))
   
   <details>
   
   ```text
   Domain Name System (response)
       Transaction ID: 0x07d6
       Flags: 0x8180 Standard query response, No error
       Questions: 1
       Answer RRs: 8
       Authority RRs: 0
       Additional RRs: 1
       Queries
           s3.us-east-1.amazonaws.com: type A, class IN
               Name: s3.us-east-1.amazonaws.com
               [Name Length: 26]
               [Label Count: 4]
               Type: A (1) (Host Address)
               Class: IN (0x0001)
       Answers
           s3.us-east-1.amazonaws.com: type A, class IN, addr 16.182.97.32
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 16.182.97.32
           s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.46.62
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 52.217.46.62
           s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.4.118
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 52.217.4.118
           s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.36.80
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 52.216.36.80
           s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.38.224
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 52.216.38.224
           s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.51.128
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 52.216.51.128
           s3.us-east-1.amazonaws.com: type A, class IN, addr 3.5.1.11
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 3.5.1.11
           s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.204.96
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 54.231.204.96
       Additional records
           <Root>: type OPT
               Name: <Root>
               Type: OPT (41) 
               UDP payload size: 1232
               Higher bits in extended RCODE: 0x00
               EDNS0 version: 0
               Z: 0x0000
                   0... .... .... .... = DO bit: Cannot handle DNSSEC security 
RRs
                   .000 0000 0000 0000 = Reserved: 0x0000
               Data length: 285
               Option: PADDING
       [Request In: 1627]
       [Time: 0.072296785 seconds]
   ```
   
   </details>
   
   i.e. that's 8 different IPs with a 5s TTL.
   
   If we ask again later, we'll get a slightly different response:
   
   <details>
   
   ```text
   Domain Name System (response)
       Transaction ID: 0x817a
       Flags: 0x8180 Standard query response, No error
       Questions: 1
       Answer RRs: 8
       Authority RRs: 0
       Additional RRs: 1
       Queries
           s3.us-east-1.amazonaws.com: type A, class IN
               Name: s3.us-east-1.amazonaws.com
               [Name Length: 26]
               [Label Count: 4]
               Type: A (1) (Host Address)
               Class: IN (0x0001)
       Answers
           s3.us-east-1.amazonaws.com: type A, class IN, addr 16.182.102.192
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 16.182.102.192
           s3.us-east-1.amazonaws.com: type A, class IN, addr 52.216.136.77
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 52.216.136.77
           s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.204.8
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 52.217.204.8
           s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.196.248
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 54.231.196.248
           s3.us-east-1.amazonaws.com: type A, class IN, addr 54.231.236.208
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 54.231.236.208
           s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.201.80
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 52.217.201.80
           s3.us-east-1.amazonaws.com: type A, class IN, addr 3.5.31.42
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 3.5.31.42
           s3.us-east-1.amazonaws.com: type A, class IN, addr 52.217.128.224
               Name: s3.us-east-1.amazonaws.com
               Type: A (1) (Host Address)
               Class: IN (0x0001)
               Time to live: 5 (5 seconds)
               Data length: 4
               Address: 52.217.128.224
       Additional records
           <Root>: type OPT
               Name: <Root>
               Type: OPT (41) 
               UDP payload size: 1232
               Higher bits in extended RCODE: 0x00
               EDNS0 version: 0
               Z: 0x0000
                   0... .... .... .... = DO bit: Cannot handle DNSSEC security 
RRs
                   .000 0000 0000 0000 = Reserved: 0x0000
               Data length: 285
               Option: PADDING
       [Request In: 67958]
       [Time: 0.140540958 seconds]
   ```
   
   </details>
   
   I've search through the DNS-related RFCs but couldn't find a suggestion if 
the order is important or not, but _the internet_ 
([1](https://serverfault.com/questions/906900/importance-of-the-answer-order-in-a-dns-lookup),
 
[2](https://serverfault.com/questions/264799/what-is-the-purpose-of-a-dns-server-returning-more-than-1-a-record),
 
[3](https://superuser.com/questions/1736076/when-i-access-a-server-that-has-multiple-dns-ip-addresses-which-one-do-i-use))
 suggests that most implementation use the IPs in order (using the next one 
with a timeout) but that the standard actually makes NO claim on that front.
   
   ## Resolver
   [`reqwest`](https://github.com/seanmonstar/reqwest) -- which is the 
high-level HTTP client library that `object_store` uses -- has a high-level 
interface called 
[`Resolve`](https://docs.rs/reqwest/latest/reqwest/dns/trait.Resolve.html) 
which resolves one host name to **multiple** IP addresses.
   
   By default [`reqwest`](https://github.com/seanmonstar/reqwest) uses 
[`getaddrinfo`](https://man.archlinux.org/man/getaddrinfo.3) (see 
[1](https://github.com/seanmonstar/reqwest/blob/37074368012ce42e61e5649c2fffcf8c8a979e1e/src/async_impl/client.rs#L319-L325),
 
[2](https://github.com/seanmonstar/reqwest/blob/37074368012ce42e61e5649c2fffcf8c8a979e1e/src/dns/gai.rs#L8-L32),
 
[3](https://github.com/hyperium/hyper-util/blob/46826ea75836852fac53ff075a12cba7e290946e/src/client/legacy/connect/dns.rs#L43-L47)),
 i.e. the system resolver. That one will very likely cache resolution based on 
the 5s TTL (see above). In fact I can see that behavior using 
[Wireshark](https://www.wireshark.org/).
   
   ## Address Usage
   Now how are these multiple addresses used: If you search through the code, 
you'll eventually get 
[here](https://github.com/hyperium/hyper-util/blob/46826ea75836852fac53ff075a12cba7e290946e/src/client/legacy/connect/http.rs#L697-L722)
 and see that [`hyper-util`](https://github.com/hyperium/hyper-util) (used by 
[`reqwest`](https://github.com/seanmonstar/reqwest) for the wiring of low-level 
components) will try to connect to the IP addresses in order and will only 
continue of the connection cannot be established or a timeout occurs. So in the 
_happy path_ this will always connect to the first address.
   
   # Solutions
   I think we should keep using 
[`reqwest`](https://github.com/seanmonstar/reqwest) since in general it serves 
us well. So a natural way to change the current behavior would be using the 
aforementioned 
[`Resolve`](https://docs.rs/reqwest/latest/reqwest/dns/trait.Resolve.html) 
interface. I see two general options, both as extensions to 
[`ClientOptions`](https://docs.rs/object_store/latest/object_store/struct.ClientOptions.html).
   
   ## A: Expose `Resolve`
   Add a way for users to specify their own 
[`Resolve`](https://docs.rs/reqwest/latest/reqwest/dns/trait.Resolve.html) 
implementation.
   
   **Pros:**
   - users can also implement other resolver sources, caching, metrics & logs 
(e.g. to debug broken DNS setup)
   
   **Cons:**
   - users need to write more code to get an arguably "reasonable" behavior
   
   ## B: Add `randomize_addrs` flag
   Add a flag `randomize_addrs`. If it is set to `true` (by default?), then 
`object_store` will wrap the default resolver and shuffle the addresses before 
returning it back to [`reqwest`](https://github.com/seanmonstar/reqwest).
   
   **Pros:**
   - sensible default
   
   **Cons:**
   - less extensible


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to