gaborgsomogyi commented on PR #28112: URL: https://github.com/apache/flink/pull/28112#issuecomment-4390447597
Hi @Samrat002, great work on this PR - the lazy seek and range request
approach is a solid foundation. While reviewing I found one correctness issue
in `lazySeek()` worth fixing.
What works well
A forward seek that lands within the read-ahead buffer is served entirely
from the local buffer with zero underlying stream access. The attached patch
includes `seekWithinBuffer_afterSmallRead_doesNotTouchUnderlyingStream` which
demonstrates and passes this case.
Known trade-off: bulk reads bypass the BufferedInputStream buffer
`BufferedInputStream.read(byte[], off, len)` has an internal fast-path: when
`len >= bufferSize` it reads directly from the underlying stream, bypassing the
local array entirely. After such a read the buffer is empty, so a small forward
seek costs exactly the seek distance in underlying stream bytes.
`seekWithinBuffer_afterLargeRead_touchesUnderlyingStream` in the patch
documents and passes this, asserting `bytesSkippedFromUnderlying == 10`. This
is an acceptable trade-off - bulk reads perform well for large sequential
access.
Actual bug: `forwardSeekLimit` is unbounded
`long forwardSeekLimit = Math.max((long) readBufferSize, (long)
bufferedStream.available());`
`bufferedStream.available()` returns `(localBufferBytes) + in.available()`,
where `in` is the live `ResponseInputStream`. If the underlying HTTP client
reports a large `available()` - in the worst case the full remaining content
length - `forwardSeekLimit` grows proportionally to the file size. A seek that
exceeds readBufferSize but falls under this inflated limit triggers
`skipBytesInBuffer()` instead of reopening with a range request, downloading
and discarding an unbounded number of bytes from the live HTTP connection.
`seekBeyondReadBufferSize_inflatedByUnderlyingAvailable_reopensWithRangeRequest`
in the patch demonstrates this - it currently fails because `getObjectCalls()`
is 1 instead of 2 and `bytesSkippedFromUnderlying` is non-zero.
[buffer-efficiency-tests.patch](https://github.com/user-attachments/files/27450058/buffer-efficiency-tests.patch)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
