mayuropensource opened a new pull request #7022:
URL: https://github.com/apache/arrow/pull/7022


   _(Recreating the PR from a clean repo, sorry about earlier PR which was not 
cleanly merged)._
   
   **JIRA:** https://issues.apache.org/jira/browse/ARROW-8562
   
   This change is not actually used until #6744 (@lidavidm) is pushed, however, 
it doesn't need to wait for the other pull request to be merged.
   
   **Description:**
   The adaptive I/O coalescing algorithm uses two parameters:
   
       max_io_gap or hole_size_limit: Max I/O gap/hole size in bytes
       ideal_request_size or range_size_limit: Ideal I/O Request size in bytes
   
   These parameters can be derived from S3 metrics as described below:
   
   In an S3 compatible storage, there are two main metrics:
   
       Seek-time or Time-To-First-Byte (TTFB) in seconds: call setup latency of 
a new S3 request
   
       Transfer Bandwidth (BW) for data in bytes/sec
   
       Computing max_io_gap or hole_size_limit:
   
   max_io_gap = TTFB * BW
   
   This is also called Bandwidth-Delay-Product (BDP).
   
   Two byte ranges that have a gap can still be mapped to the same read if the 
gap is less than the bandwidth-delay product [TTFB * TransferBandwidth], i.e. 
if the Time-To-First-Byte (or call setup latency of a new S3 request) is 
expected to be greater than just reading and discarding the extra bytes on an 
existing HTTP request.
   
       Computing ideal_request_size or range_size_limit:
   
   We want to have high bandwidth utilization per S3 connections, i.e. transfer 
large amounts of data to amortize the seek overhead.
   But, we also want to leverage parallelism by slicing very large IO chunks. 
We define two more config parameters with suggested default values to control 
the slice size and seek to balance the two effects with the goal of maximizing 
net data load performance.
   
   BW_util (ideal bandwidth utilization):
   This means what fraction of per connection bandwidth should be utilized to 
maximize net data load.
   A good default value is 90% or 0.9.
   
   MAX_IDEAL_REQUEST_SIZE:
   This means what is the maximum single request size (in bytes) to maximize 
net data load.
   A good default value is 64 MiB.
   
   The amount of data that needs to be transferred in a single S3 get_object 
request to achieve effective bandwidth eff_BW = BW_util * BW is as follows:
   eff_BW = ideal_request_size / (TTFB + ideal_request_size / BW)
   
   Substituting TTFB = max_io_gap / BW and eff_BW = BW_util * BW, we get the 
following result:
   ideal_request_size = max_io_gap * BW_util / (1 - BW_util)
   
   Applying the MAX_IDEAL_REQUEST_SIZE, we get the following:
   ideal_request_size = min(MAX_IDEAL_REQUEST_SIZE, max_io_gap * BW_util / (1 - 
BW_util))


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to