Re: [PR] RFC: Buffered reader [incubator-opendal]

via GitHub Sun, 10 Dec 2023 23:04:54 -0800


Xuanwo commented on code in PR #3734:
URL: 
https://github.com/apache/incubator-opendal/pull/3734#discussion_r1422010283



##########
core/src/docs/rfcs/3734_buffered_reader.md:
##########
@@ -0,0 +1,113 @@
+- Proposal Name: `buffered_reader`
+- Start Date: 2023-12-10
+- RFC PR: 
[apache/incubator-opendal#3574](https://github.com/apache/incubator-opendal/pull/3734)
+- Tracking Issue: 
[apache/incubator-opendal#3575](https://github.com/apache/incubator-opendal/issues/3735)
+
+# Summary
+
+Amortize overhead of IO.
+
+# Motivation
+
+The aim is to amortize the overhead of IO. Some scenarios, such as importing 
data to the database, typically require supporting multiple file formats, and 
the size of files varies(very large or very small). Most readers of these 
formats may behave similarly to the `ParquetStream`; they read and decode data 
piece after piece.
+
+```
+File
+┌───┬───┬───┬──┐
+│   │   │   │  │
+└───┴───┴───┘▲─┘
+             │
+             │ 1. SeekFromEnd(8)
+
+               2. ReadToEnd(limit)
+```
+
+However, OpenDAL will yield an IO request after `seek()` and `read()` are 
invoked. For storage services such as S3, The 
[research](https://www.vldb.org/pvldb/vol16/p2769-durner.pdf) shows that the 
8MiB~16MiB is the preferred IO size, If the IO size is too small, the TTFB 
dominates the overall runtime; too large will reach the bandwidth limit.
+
+Therefore, This RFC proposes a buffered reader to amortize the overhead of IO.
+
+# Guide-level explanation
+
+For users who want to buffered reader, they will call the new API `buffer`. 
And the default behavior remains unchanged, so users using `op.reader_with()` 
are not affected. The `buffer` function will take a number as input, and this 
number will represent the maximum buffer capability the reader is able to use. 
+
+```rust
+op.reader_with(path).buffer(32 * 1024 * 1024).await
+```
+
+If a user prefers to customize the `Preferred IO Size`, it provides a 
`read_size` option (default is 8MiB):
+
+```rust
+op.reader_with(path).buffer(32 * 1024 * 1024).read_size(8 * 1024 * 1024).await
+```
+
+# Reference-level explanation
+
+There is an invariant: the `Buffer Capability` is at least greater than or 
equal to the `Preferred IO Size`.

Review Comment:
   Let's focus on how buffer works here. I believe it's different with 
`preferred io size`.



##########
core/src/docs/rfcs/3734_buffered_reader.md:
##########
@@ -0,0 +1,113 @@
+- Proposal Name: `buffered_reader`
+- Start Date: 2023-12-10
+- RFC PR: 
[apache/incubator-opendal#3574](https://github.com/apache/incubator-opendal/pull/3734)
+- Tracking Issue: 
[apache/incubator-opendal#3575](https://github.com/apache/incubator-opendal/issues/3735)
+
+# Summary
+
+Amortize overhead of IO.
+
+# Motivation
+
+The aim is to amortize the overhead of IO. Some scenarios, such as importing 
data to the database, typically require supporting multiple file formats, and 
the size of files varies(very large or very small). Most readers of these 
formats may behave similarly to the `ParquetStream`; they read and decode data 
piece after piece.
+
+```
+File
+┌───┬───┬───┬──┐
+│   │   │   │  │
+└───┴───┴───┘▲─┘
+             │
+             │ 1. SeekFromEnd(8)
+
+               2. ReadToEnd(limit)
+```
+
+However, OpenDAL will yield an IO request after `seek()` and `read()` are 
invoked. For storage services such as S3, The 
[research](https://www.vldb.org/pvldb/vol16/p2769-durner.pdf) shows that the 
8MiB~16MiB is the preferred IO size, If the IO size is too small, the TTFB 
dominates the overall runtime; too large will reach the bandwidth limit.
+
+Therefore, This RFC proposes a buffered reader to amortize the overhead of IO.
+
+# Guide-level explanation
+
+For users who want to buffered reader, they will call the new API `buffer`. 
And the default behavior remains unchanged, so users using `op.reader_with()` 
are not affected. The `buffer` function will take a number as input, and this 
number will represent the maximum buffer capability the reader is able to use. 
+
+```rust
+op.reader_with(path).buffer(32 * 1024 * 1024).await
+```
+
+If a user prefers to customize the `Preferred IO Size`, it provides a 
`read_size` option (default is 8MiB):
+
+```rust
+op.reader_with(path).buffer(32 * 1024 * 1024).read_size(8 * 1024 * 1024).await

Review Comment:
   `read_size` seems not a good API and is not related to `buffer` API, how 
about remove from this RFC?



##########
core/src/docs/rfcs/3734_buffered_reader.md:
##########
@@ -0,0 +1,113 @@
+- Proposal Name: `buffered_reader`
+- Start Date: 2023-12-10
+- RFC PR: 
[apache/incubator-opendal#3574](https://github.com/apache/incubator-opendal/pull/3734)
+- Tracking Issue: 
[apache/incubator-opendal#3575](https://github.com/apache/incubator-opendal/issues/3735)
+
+# Summary
+
+Amortize overhead of IO.
+
+# Motivation
+
+The aim is to amortize the overhead of IO. Some scenarios, such as importing 
data to the database, typically require supporting multiple file formats, and 
the size of files varies(very large or very small). Most readers of these 
formats may behave similarly to the `ParquetStream`; they read and decode data 
piece after piece.
+
+```
+File
+┌───┬───┬───┬──┐
+│   │   │   │  │
+└───┴───┴───┘▲─┘
+             │
+             │ 1. SeekFromEnd(8)
+
+               2. ReadToEnd(limit)
+```
+
+However, OpenDAL will yield an IO request after `seek()` and `read()` are 
invoked. For storage services such as S3, The 
[research](https://www.vldb.org/pvldb/vol16/p2769-durner.pdf) shows that the 
8MiB~16MiB is the preferred IO size, If the IO size is too small, the TTFB 
dominates the overall runtime; too large will reach the bandwidth limit.

Review Comment:
   The default io size is different for different services, let's take `fs` and 
`s3` as examples:
   
   - `fs`: opendal forward the `buf` directly, so the io size is controlled by 
users input.
   - `s3`: opendal can't control the underlying read io size (which is usually 
controlled by  system's TCP stack). OpenDAL sends requests with the whole range.
   
   So there are two problems:
   
   - For `fs`, the io size could be too small. For example, users reading data 
with a `&mut [u8:1]`
   - For `s3`, the io size could be too large. For example, users only want to 
read 1MiB, but we send a request for the whole file size.
   
   `buffer` can give users a way to control the underlying read behavior. 
Besides, buffer allows user to seek inside already fetched data without sending 
new requests.
   



##########
core/src/docs/rfcs/3734_buffered_reader.md:
##########
@@ -0,0 +1,113 @@
+- Proposal Name: `buffered_reader`
+- Start Date: 2023-12-10
+- RFC PR: 
[apache/incubator-opendal#3574](https://github.com/apache/incubator-opendal/pull/3734)
+- Tracking Issue: 
[apache/incubator-opendal#3575](https://github.com/apache/incubator-opendal/issues/3735)
+
+# Summary
+
+Amortize overhead of IO.
+
+# Motivation
+
+The aim is to amortize the overhead of IO. Some scenarios, such as importing 
data to the database, typically require supporting multiple file formats, and 
the size of files varies(very large or very small). Most readers of these 
formats may behave similarly to the `ParquetStream`; they read and decode data 
piece after piece.
+
+```
+File
+┌───┬───┬───┬──┐
+│   │   │   │  │
+└───┴───┴───┘▲─┘
+             │
+             │ 1. SeekFromEnd(8)
+
+               2. ReadToEnd(limit)
+```
+
+However, OpenDAL will yield an IO request after `seek()` and `read()` are 
invoked. For storage services such as S3, The 
[research](https://www.vldb.org/pvldb/vol16/p2769-durner.pdf) shows that the 
8MiB~16MiB is the preferred IO size, If the IO size is too small, the TTFB 
dominates the overall runtime; too large will reach the bandwidth limit.
+
+Therefore, This RFC proposes a buffered reader to amortize the overhead of IO.
+
+# Guide-level explanation
+
+For users who want to buffered reader, they will call the new API `buffer`. 
And the default behavior remains unchanged, so users using `op.reader_with()` 
are not affected. The `buffer` function will take a number as input, and this 
number will represent the maximum buffer capability the reader is able to use. 
+
+```rust
+op.reader_with(path).buffer(32 * 1024 * 1024).await
+```
+
+If a user prefers to customize the `Preferred IO Size`, it provides a 
`read_size` option (default is 8MiB):

Review Comment:
   I believe they should be separate entities and we shouldn't include them in 
the same RFCs.



##########
core/src/docs/rfcs/3734_buffered_reader.md:
##########
@@ -0,0 +1,113 @@
+- Proposal Name: `buffered_reader`
+- Start Date: 2023-12-10
+- RFC PR: 
[apache/incubator-opendal#3574](https://github.com/apache/incubator-opendal/pull/3734)
+- Tracking Issue: 
[apache/incubator-opendal#3575](https://github.com/apache/incubator-opendal/issues/3735)
+
+# Summary
+
+Amortize overhead of IO.
+
+# Motivation
+
+The aim is to amortize the overhead of IO. Some scenarios, such as importing 
data to the database, typically require supporting multiple file formats, and 
the size of files varies(very large or very small). Most readers of these 
formats may behave similarly to the `ParquetStream`; they read and decode data 
piece after piece.
+
+```
+File
+┌───┬───┬───┬──┐
+│   │   │   │  │
+└───┴───┴───┘▲─┘
+             │
+             │ 1. SeekFromEnd(8)
+
+               2. ReadToEnd(limit)
+```
+
+However, OpenDAL will yield an IO request after `seek()` and `read()` are 
invoked. For storage services such as S3, The 
[research](https://www.vldb.org/pvldb/vol16/p2769-durner.pdf) shows that the 
8MiB~16MiB is the preferred IO size, If the IO size is too small, the TTFB 
dominates the overall runtime; too large will reach the bandwidth limit.
+
+Therefore, This RFC proposes a buffered reader to amortize the overhead of IO.
+
+# Guide-level explanation
+
+For users who want to buffered reader, they will call the new API `buffer`. 
And the default behavior remains unchanged, so users using `op.reader_with()` 
are not affected. The `buffer` function will take a number as input, and this 
number will represent the maximum buffer capability the reader is able to use. 
+
+```rust
+op.reader_with(path).buffer(32 * 1024 * 1024).await
+```
+
+If a user prefers to customize the `Preferred IO Size`, it provides a 
`read_size` option (default is 8MiB):
+
+```rust
+op.reader_with(path).buffer(32 * 1024 * 1024).read_size(8 * 1024 * 1024).await
+```
+
+# Reference-level explanation
+
+There is an invariant: the `Buffer Capability` is at least greater than or 
equal to the `Preferred IO Size`.
+
+**Buffering**
+
+The target file will be split into multiple segments to align with the 
`Preferred IO Size`, and the buffered segments also align with the `Preferred 
IO Size` for efficient lookup.

Review Comment:
   Will this RFC cover `multiple segments`? I feel like they should be another 
topic?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] RFC: Buffered reader [incubator-opendal]

Reply via email to