[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

ASF GitHub Bot (Jira) Tue, 26 Apr 2022 04:29:06 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-18177?focusedWorklogId=762239&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762239
 ]


ASF GitHub Bot logged work on HADOOP-18177:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Apr/22 11:28
            Start Date: 26/Apr/22 11:28
    Worklog Time Spent: 10m 
      Work Description: dannycjones commented on code in PR #4205:
URL: https://github.com/apache/hadoop/pull/4205#discussion_r858597582


##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##########
@@ -0,0 +1,192 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# S3A Prefetching
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input
+stream.
+A high level overview of this feature was published in
+[Pinterest Engineering's blog post titled "Improving efficiency and reducing 
runtime using S3 read 
optimization"](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0).
+
+With prefetching, the input stream divides the remote file into blocks of a 
fixed size, associates
+buffers to these blocks and then reads data into these buffers asynchronously. 
+It also potentially caches these blocks.
+
+### Basic Concepts
+
+* **Remote File**: A binary blob of data stored on some storage device.
+* **Block File**: Local file containing a block of the remote file.
+* **Block**: A file is divided into a number of blocks. 
+The size of the first n-1 blocks is same, and the size of the last block may 
be same or smaller.
+* **Block based reading**: The granularity of read is one block. 
+That is, either an entire block is read and returned or none at all. 
+Multiple blocks may be read in parallel.
+
+### Configuring the stream
+
+|Property    |Meaning    |Default    |
+|---   |---    |---    |
+|`fs.s3a.prefetch.enabled`    |Enable the prefetch input stream    |`true` |
+|`fs.s3a.prefetch.block.size`    |Size of a block    |`8M`    |
+|`fs.s3a.prefetch.block.count`    |Number of blocks to prefetch    |`8`    |
+
+### Key Components
+
+`S3PrefetchingInputStream` - When prefetching is enabled, S3AFileSystem will 
return an instance of
+this class as the input stream. 
+Depending on the remote file size, it will either use
+the `S3InMemoryInputStream` or the `S3CachingInputStream` as the underlying 
input stream.
+
+`S3InMemoryInputStream` - Underlying input stream used when the remote file 
size < configured block
+size. 
+Will read the entire remote file into memory.
+
+`S3CachingInputStream` - Underlying input stream used when remote file size > 
configured block size.
+Uses asynchronous prefetching of blocks and caching to improve performance.
+
+`BlockData` - Holds information about the blocks in a remote file, such as:
+
+* Number of blocks in the remote file
+* Block size
+* State of each block (initially all blocks have state *NOT_READY*). 
+Other states are: Queued, Ready, Cached.
+
+`BufferData` - Holds the buffer and additional information about it such as:
+
+* The block number this buffer is for
+* State of the buffer (Unknown, Blank, Prefetching, Caching, Ready, Done). 
+Initial state of a buffer is blank.
+
+`CachingBlockManager` - Implements reading data into the buffer, prefetching 
and caching.
+
+`BufferPool` - Manages a fixed sized pool of buffers. 
+It’s used by `CachingBlockManager` to acquire buffers.
+
+`S3File` - Implements operations to interact with S3 such as opening and 
closing the input stream to
+the remote file in S3.
+
+`S3Reader` - Implements reading from the stream opened by `S3File`. 
+Reads from this input stream in blocks of 64KB.
+
+`FilePosition` - Provides functionality related to tracking the position in 
the file. 
+Also gives access to the current buffer in use.
+
+`SingleFilePerBlockCache` - Responsible for caching blocks to the local file 
system. 
+Each cache block is stored on the local disk as a separate block file.
+
+### Operation
+
+#### S3InMemoryInputStream
+
+For a remote file with size 5MB, and block size = 8MB, since file size is less 
than the block size,
+the `S3InMemoryInputStream` will be used.
+
+If the caller makes the following read calls:
+
+```
+in.read(buffer, 0, 3MB);
+in.read(buffer, 0, 2MB);
+```
+
+When the first read is issued, there is no buffer in use yet. 
+The `S3InMemoryInputStream` gets the data in this remote file by calling the 
`ensureCurrentBuffer()` 
+method, which ensures that a buffer with data is available to be read from.
+
+The `ensureCurrentBuffer()` then:
+
+* Reads data into a buffer by calling `S3Reader.read(ByteBuffer buffer, long 
offset, int size)`.
+* `S3Reader` uses `S3File` to open an input stream to the remote file in S3 by 
making
+  a `getObject()` request with range as `(0, filesize)`.
+* The `S3Reader` reads the entire remote file into the provided buffer, and 
once reading is complete
+  closes the S3 stream and frees all underlying resources.
+* Now the entire remote file is in a buffer, set this data in `FilePosition` 
so it can be accessed
+  by the input stream.
+
+The read operation now just gets the required bytes from the buffer in 
`FilePosition`.
+
+When the second read is issued, there is already a valid buffer which can be 
used. 
+Don’t do anything else, just read the required bytes from this buffer.
+
+#### S3CachingInputStream
+
+If there is a remote file with size 40MB and block size = 8MB, the 
`S3CachingInputStream` will be
+used.
+
+##### Sequential Reads
+
+If the caller makes the following calls:
+
+```
+in.read(buffer, 0, 5MB)
+in.read(buffer, 0, 8MB)
+```
+
+For the first read call, there is no valid buffer yet. 
+`ensureCurrentBuffer()` is called, and for the first `read()`, prefetch count 
is set as 1.
+
+The current block (block 0) is read synchronously, while the blocks to be 
prefetched (block 1) is
+read asynchronously.
+
+The `CachingBlockManager` is responsible for getting buffers from the buffer 
pool and reading data
+into them. This process of acquiring the buffer pool works as follows:
+
+* The buffer pool keeps a map of allocated buffers and a pool of available 
buffers. 
+The size of this pool is = prefetch block count + 1. 
+If the prefetch block count is 8, the buffer pool has a size of 9.
+* If the pool is not yet at capacity, create a new buffer and add it to the 
pool.
+* If it’s at capacity, check if any buffers with state = done can be released. 
+Releasing a buffer means removing it from allocated and returning it back to 
the pool of available 
+buffers.
+* If there are no buffers with state = done currently then nothing will be 
released, so retry the
+  above step at a fixed interval a few times till a buffer becomes available.
+* If after multiple retries there are still no available buffers, release a 
buffer in the ready state. 
+The buffer for the block furthest from the current block is released.
+
+Once a buffer has been acquired by `CachingBlockManager`, if the buffer is in 
a *READY* state, it is
+returned. 
+This means that data was already read into this buffer asynchronously by a 
prefetch. 
+If it’s state is *BLANK,* then data is read into it using 

Review Comment:
   we can move out the comma or drop it here
   
   ```suggestion
   If it’s state is *BLANK* then data is read into it using 
   ```



##########
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md:
##########
@@ -0,0 +1,192 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+# S3A Prefetching
+
+This document explains the `S3PrefetchingInputStream` and the various 
components it uses.
+
+This input stream implements prefetching and caching to improve read 
performance of the input
+stream.
+A high level overview of this feature was published in
+[Pinterest Engineering's blog post titled "Improving efficiency and reducing 
runtime using S3 read 
optimization"](https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0).
+
+With prefetching, the input stream divides the remote file into blocks of a 
fixed size, associates
+buffers to these blocks and then reads data into these buffers asynchronously. 
+It also potentially caches these blocks.
+
+### Basic Concepts
+
+* **Remote File**: A binary blob of data stored on some storage device.
+* **Block File**: Local file containing a block of the remote file.
+* **Block**: A file is divided into a number of blocks. 
+The size of the first n-1 blocks is same, and the size of the last block may 
be same or smaller.
+* **Block based reading**: The granularity of read is one block. 
+That is, either an entire block is read and returned or none at all. 
+Multiple blocks may be read in parallel.
+
+### Configuring the stream
+
+|Property    |Meaning    |Default    |
+|---   |---    |---    |
+|`fs.s3a.prefetch.enabled`    |Enable the prefetch input stream    |`true` |
+|`fs.s3a.prefetch.block.size`    |Size of a block    |`8M`    |
+|`fs.s3a.prefetch.block.count`    |Number of blocks to prefetch    |`8`    |
+
+### Key Components
+
+`S3PrefetchingInputStream` - When prefetching is enabled, S3AFileSystem will 
return an instance of
+this class as the input stream. 
+Depending on the remote file size, it will either use
+the `S3InMemoryInputStream` or the `S3CachingInputStream` as the underlying 
input stream.
+
+`S3InMemoryInputStream` - Underlying input stream used when the remote file 
size < configured block
+size. 
+Will read the entire remote file into memory.
+
+`S3CachingInputStream` - Underlying input stream used when remote file size > 
configured block size.
+Uses asynchronous prefetching of blocks and caching to improve performance.
+
+`BlockData` - Holds information about the blocks in a remote file, such as:
+
+* Number of blocks in the remote file
+* Block size
+* State of each block (initially all blocks have state *NOT_READY*). 
+Other states are: Queued, Ready, Cached.
+
+`BufferData` - Holds the buffer and additional information about it such as:
+
+* The block number this buffer is for
+* State of the buffer (Unknown, Blank, Prefetching, Caching, Ready, Done). 
+Initial state of a buffer is blank.
+
+`CachingBlockManager` - Implements reading data into the buffer, prefetching 
and caching.
+
+`BufferPool` - Manages a fixed sized pool of buffers. 
+It’s used by `CachingBlockManager` to acquire buffers.
+
+`S3File` - Implements operations to interact with S3 such as opening and 
closing the input stream to
+the remote file in S3.
+
+`S3Reader` - Implements reading from the stream opened by `S3File`. 
+Reads from this input stream in blocks of 64KB.
+
+`FilePosition` - Provides functionality related to tracking the position in 
the file. 
+Also gives access to the current buffer in use.
+
+`SingleFilePerBlockCache` - Responsible for caching blocks to the local file 
system. 
+Each cache block is stored on the local disk as a separate block file.
+
+### Operation
+
+#### S3InMemoryInputStream
+
+For a remote file with size 5MB, and block size = 8MB, since file size is less 
than the block size,
+the `S3InMemoryInputStream` will be used.
+
+If the caller makes the following read calls:
+
+```
+in.read(buffer, 0, 3MB);
+in.read(buffer, 0, 2MB);
+```
+
+When the first read is issued, there is no buffer in use yet. 
+The `S3InMemoryInputStream` gets the data in this remote file by calling the 
`ensureCurrentBuffer()` 
+method, which ensures that a buffer with data is available to be read from.
+
+The `ensureCurrentBuffer()` then:
+
+* Reads data into a buffer by calling `S3Reader.read(ByteBuffer buffer, long 
offset, int size)`.
+* `S3Reader` uses `S3File` to open an input stream to the remote file in S3 by 
making
+  a `getObject()` request with range as `(0, filesize)`.
+* The `S3Reader` reads the entire remote file into the provided buffer, and 
once reading is complete
+  closes the S3 stream and frees all underlying resources.
+* Now the entire remote file is in a buffer, set this data in `FilePosition` 
so it can be accessed
+  by the input stream.
+
+The read operation now just gets the required bytes from the buffer in 
`FilePosition`.
+
+When the second read is issued, there is already a valid buffer which can be 
used. 
+Don’t do anything else, just read the required bytes from this buffer.
+
+#### S3CachingInputStream
+
+If there is a remote file with size 40MB and block size = 8MB, the 
`S3CachingInputStream` will be
+used.
+
+##### Sequential Reads
+
+If the caller makes the following calls:
+
+```
+in.read(buffer, 0, 5MB)
+in.read(buffer, 0, 8MB)
+```
+
+For the first read call, there is no valid buffer yet. 
+`ensureCurrentBuffer()` is called, and for the first `read()`, prefetch count 
is set as 1.
+
+The current block (block 0) is read synchronously, while the blocks to be 
prefetched (block 1) is
+read asynchronously.
+
+The `CachingBlockManager` is responsible for getting buffers from the buffer 
pool and reading data
+into them. This process of acquiring the buffer pool works as follows:
+
+* The buffer pool keeps a map of allocated buffers and a pool of available 
buffers. 
+The size of this pool is = prefetch block count + 1. 
+If the prefetch block count is 8, the buffer pool has a size of 9.
+* If the pool is not yet at capacity, create a new buffer and add it to the 
pool.
+* If it’s at capacity, check if any buffers with state = done can be released. 

Review Comment:
   I noticed Yetus formats these `’` weirdly. Can we switch to standard single 
quote where-ever `’` occurs?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 762239)
    Time Spent: 2h 10m  (was: 2h)

> document use and architecture design of prefetching s3a input stream
> --------------------------------------------------------------------
>
>                 Key: HADOOP-18177
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18177
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: documentation, fs/s3
>    Affects Versions: 3.4.0
>            Reporter: Steve Loughran
>            Assignee: Ahmar Suhail
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Document S3PrefetchingInputStream for users  (including any new failure modes 
> in troubleshooting) and the architecture for maintainers
> there's some markdown in 
> hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/README.md 
> already



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Work logged] (HADOOP-18177) document use and architecture design of prefetching s3a input stream

Reply via email to