mayuropensource opened a new pull request #7022: URL: https://github.com/apache/arrow/pull/7022
_(Recreating the PR from a clean repo, sorry about earlier PR which was not cleanly merged)._ **JIRA:** https://issues.apache.org/jira/browse/ARROW-8562 This change is not actually used until #6744 (@lidavidm) is pushed, however, it doesn't need to wait for the other pull request to be merged. **Description:** The adaptive I/O coalescing algorithm uses two parameters: max_io_gap or hole_size_limit: Max I/O gap/hole size in bytes ideal_request_size or range_size_limit: Ideal I/O Request size in bytes These parameters can be derived from S3 metrics as described below: In an S3 compatible storage, there are two main metrics: Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of a new S3 request Transfer Bandwidth (BW) for data in bytes/sec Computing max_io_gap or hole_size_limit: max_io_gap = TTFB * BW This is also called Bandwidth-Delay-Product (BDP). Two byte ranges that have a gap can still be mapped to the same read if the gap is less than the bandwidth-delay product [TTFB * TransferBandwidth], i.e. if the Time-To-First-Byte (or call setup latency of a new S3 request) is expected to be greater than just reading and discarding the extra bytes on an existing HTTP request. Computing ideal_request_size or range_size_limit: We want to have high bandwidth utilization per S3 connections, i.e. transfer large amounts of data to amortize the seek overhead. But, we also want to leverage parallelism by slicing very large IO chunks. We define two more config parameters with suggested default values to control the slice size and seek to balance the two effects with the goal of maximizing net data load performance. BW_util (ideal bandwidth utilization): This means what fraction of per connection bandwidth should be utilized to maximize net data load. A good default value is 90% or 0.9. MAX_IDEAL_REQUEST_SIZE: This means what is the maximum single request size (in bytes) to maximize net data load. A good default value is 64 MiB. The amount of data that needs to be transferred in a single S3 get_object request to achieve effective bandwidth eff_BW = BW_util * BW is as follows: eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW) Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the following result: ideal_request_size = max_io_gap * BW_util / (1 - BW_util) Applying the MAX_IDEAL_REQUEST_SIZE, we get the following: ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - BW_util)) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org