Anuj Modi created HADOOP-18971:
----------------------------------
Summary: ABFS: Enable Footer Read Optimizations with Appropriate
Footer Read Buffer Size
Key: HADOOP-18971
URL: https://issues.apache.org/jira/browse/HADOOP-18971
Project: Hadoop Common
Issue Type: Sub-task
Components: fs/azure
Affects Versions: 3.3.6
Reporter: Anuj Modi
Footer Read Optimization was introduced to Hadoop azure in this Jira:
https://issues.apache.org/jira/browse/HADOOP-17347
and was kept disabled by default.
This PR is to enable footer reads by default based on the results of analysis
performed as below:
In our scale workload analysis, it was found that workloads working with
Parquet (or for that matter OCR etc.) have a lot of footer reads. Footer reads
here refers to the read operations done by workload to get the metadata of the
parquet file which is required to understand where the actual data resides in
the parquet.
This whole process takes place in 3 steps:
# Workload reads the last 8 bytes of parquet file to get the offset and size
of the metadata which is present just above these 8 bytes.
# Using that offset, workload reads the metadata to get the exact offset and
length of data which it wants to read.
# Workload performs the final read operation to get the data it wants to use
for its purpose.
Here the first two steps are metadata reads that can be combined into a single
footer read. When workload tries to read certain last few bytes of data (let's
say this value is footer size), driver will intelligently read some extra bytes
above the footer size to cater to the next read which is going to come.
Q. What is the footer size of file?
A: 16KB. Any read request trying to get the data within last 16KB of the file
will qualify for whole footer read. This value is enough to cater to all types
of files including parquet, OCR, etc.
Q. What is the buffer size to read when reading the footer?
A. Let's call this footer read buffer size. Prior to this PR footer read buffer
size was same as read buffer size (default 4MB). It was found that for most of
the workload required footer size was only 256KB. i.e. For almost all parquet
files metadata for that file was found to be within last 256KBs. Keeping this
in mind it does not make sense to read whole buffer length of 4MB as a part of
footer read. Moreover, reading larger data than require incur additional costs
in terms of server and network latencies. Based on this and extensive
experimentation it was observed that footer read buffer size of 512KB is ideal
for almost all the workloads running on parquet, OCR, etc.
Following configuration was introduced to configure the footer read buffer size:
{*}fs.azure.footer.read.request.size{*}: default 512 KB.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]