Xuanwo commented on code in PR #3734: URL: https://github.com/apache/incubator-opendal/pull/3734#discussion_r1423943001
########## core/src/docs/rfcs/3734_buffered_reader.md: ########## @@ -0,0 +1,150 @@ +- Proposal Name: `buffered_reader` +- Start Date: 2023-12-10 +- RFC PR: [apache/incubator-opendal#3574](https://github.com/apache/incubator-opendal/pull/3734) +- Tracking Issue: [apache/incubator-opendal#3575](https://github.com/apache/incubator-opendal/issues/3735) + +# Summary + +Allowing the underlying reader to fetch data at the buffer's size to amortize the IO's overhead. + +# Motivation + +The objective is to mitigate the IO overhead. In certain scenarios, the reader processes the data incrementally, meaning that it utilizes the `seek()` function to navigate to a specific position within the file. Subsequently, it invokes the `read()` to reads `limit` bytes into memory and performs the decoding process. + +``` +File +┌───┬───┬───┬──┐ +│ │ │ │ │ +└───┴───┴───┘▲─┘ + │ + │ 1. SeekFromEnd(8) + + 2. ReadToEnd(limit) +``` + +OpenDAL triggers an IO request upon invoking `read()` if the `seek()` has reset the inner state. For storage services like S3, [research](https://www.vldb.org/pvldb/vol16/p2769-durner.pdf) suggests that an optimal IO size falls within the range of 8MiB to 16MiB. If the IO size is too small, the Time To First Byte (TTFB) dominates the overall runtime, resulting in inefficiency. + +Therefore, this RFC proposes the implementation of a buffered reader to amortize the overhead of IO. + +# Guide-level explanation + +For users who want to buffered reader, they will call the new API `buffer`. And the default behavior remains unchanged, so users using `op.reader_with()` are not affected. The `buffer` function will take a number as input, and this number will represent the maximum buffer capability the reader is able to use. + +```rust +op.reader_with(path).buffer(32 * 1024 * 1024).await +``` + +# Reference-level explanation Review Comment: *Reference-level explanation* should go deep into the code to explain how will you implement this feature in general. For example: - Will you need to add a new struct? - Will you finish those work at CompleteLayer? - How will you make it works for all services? ########## core/src/docs/rfcs/3734_buffered_reader.md: ########## @@ -0,0 +1,150 @@ +- Proposal Name: `buffered_reader` +- Start Date: 2023-12-10 +- RFC PR: [apache/incubator-opendal#3574](https://github.com/apache/incubator-opendal/pull/3734) +- Tracking Issue: [apache/incubator-opendal#3575](https://github.com/apache/incubator-opendal/issues/3735) + +# Summary + +Allowing the underlying reader to fetch data at the buffer's size to amortize the IO's overhead. + +# Motivation + +The objective is to mitigate the IO overhead. In certain scenarios, the reader processes the data incrementally, meaning that it utilizes the `seek()` function to navigate to a specific position within the file. Subsequently, it invokes the `read()` to reads `limit` bytes into memory and performs the decoding process. + +``` +File Review Comment: Could you elaborate on how a buffer will be beneficial in this scenario? ```rust let r = op.reader(path).await; r.seek(SeekFrom::End(-8)).await; r.read_to_end(&Vec::new()).await; ``` Seems buffer is not helpful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
