Anuj Modi created HADOOP-18971:
----------------------------------

             Summary: ABFS: Enable Footer Read Optimizations with Appropriate 
Footer Read Buffer Size
                 Key: HADOOP-18971
                 URL: https://issues.apache.org/jira/browse/HADOOP-18971
             Project: Hadoop Common
          Issue Type: Sub-task
          Components: fs/azure
    Affects Versions: 3.3.6
            Reporter: Anuj Modi


Footer Read Optimization was introduced to Hadoop azure in this Jira: 
https://issues.apache.org/jira/browse/HADOOP-17347
and was kept disabled by default.
This PR is to enable footer reads by default based on the results of analysis 
performed as below:

In our scale workload analysis, it was found that workloads working with 
Parquet (or for that matter OCR etc.) have a lot of footer reads. Footer reads 
here refers to the read operations done by workload to get the metadata of the 
parquet file which is required to understand where the actual data resides in 
the parquet.
This whole process takes place in 3 steps:
 # Workload reads the last 8 bytes of parquet file to get the offset and size 
of the metadata which is present just above these 8 bytes.
 # Using that offset, workload reads the metadata to get the exact offset and 
length of data which it wants to read.
 # Workload performs the final read operation to get the data it wants to use 
for its purpose.

Here the first two steps are metadata reads that can be combined into a single 
footer read. When workload tries to read certain last few bytes of data (let's 
say this value is footer size), driver will intelligently read some extra bytes 
above the footer size to cater to the next read which is going to come.

Q. What is the footer size of file?
A: 16KB. Any read request trying to get the data within last 16KB of the file 
will qualify for whole footer read. This value is enough to cater to all types 
of files including parquet, OCR, etc.

Q. What is the buffer size to read when reading the footer?
A. Let's call this footer read buffer size. Prior to this PR footer read buffer 
size was same as read buffer size (default 4MB). It was found that for most of 
the workload required footer size was only 256KB. i.e. For almost all parquet 
files metadata for that file was found to be within last 256KBs. Keeping this 
in mind it does not make sense to read whole buffer length of 4MB as a part of 
footer read. Moreover, reading larger data than require incur additional costs 
in terms of server and network latencies. Based on this and extensive 
experimentation it was observed that footer read buffer size of 512KB is ideal 
for almost all the workloads running on parquet, OCR, etc.

Following configuration was introduced to configure the footer read buffer size:
{*}fs.azure.footer.read.request.size{*}: default 512 KB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to