gaborgsomogyi commented on PR #28112:
URL: https://github.com/apache/flink/pull/28112#issuecomment-4390447597

   Hi @Samrat002, great work on this PR - the lazy seek and range request 
approach is a solid foundation. While reviewing I found one correctness issue 
in `lazySeek()` worth fixing.          
                                                                                
                                                                                
                                                                                
       
   What works well                                                              
                                                                                
                                                                                
   
                                                                                
                                                                                
                                                                                
       
   A forward seek that lands within the read-ahead buffer is served entirely 
from the local buffer with zero underlying stream access. The attached patch 
includes `seekWithinBuffer_afterSmallRead_doesNotTouchUnderlyingStream` which 
demonstrates and passes this case.                                              
                                                                                
                                                                                
               
                                                                                
                                                                                
                                                                                
       
   Known trade-off: bulk reads bypass the BufferedInputStream buffer            
                                                                                
                                                                                
     
                                                                                
                                                                                
                                                                                
       
   `BufferedInputStream.read(byte[], off, len)` has an internal fast-path: when 
`len >= bufferSize` it reads directly from the underlying stream, bypassing the 
local array entirely. After such a read the buffer is empty, so a small forward 
seek costs exactly the seek distance in underlying stream bytes. 
`seekWithinBuffer_afterLargeRead_touchesUnderlyingStream` in the patch 
documents and passes this, asserting `bytesSkippedFromUnderlying == 10`. This 
is an acceptable trade-off - bulk  reads perform well for large sequential 
access.                                                                         
                                                                                
                                          
   
   Actual bug: `forwardSeekLimit` is unbounded                                  
                                                                                
                                                                                
       
      
   `long forwardSeekLimit = Math.max((long) readBufferSize, (long) 
bufferedStream.available());`
                                                                                
                                                                                
                                                                                
       
   `bufferedStream.available()` returns `(localBufferBytes) + in.available()`, 
where `in` is the live `ResponseInputStream`. If the underlying HTTP client 
reports a large `available()` - in the worst case the full remaining content 
length - `forwardSeekLimit` grows proportionally to the file size. A seek that 
exceeds readBufferSize but falls under this inflated limit triggers 
`skipBytesInBuffer()` instead of reopening with a range request, downloading 
and discarding an unbounded number of bytes from the live HTTP connection.      
                                                                                
                                                                                
                              
                     
   
`seekBeyondReadBufferSize_inflatedByUnderlyingAvailable_reopensWithRangeRequest`
 in the patch demonstrates this - it currently fails because `getObjectCalls()` 
is 1 instead of 2 and `bytesSkippedFromUnderlying` is non-zero.                 
        
   
   
[buffer-efficiency-tests.patch](https://github.com/user-attachments/files/27450058/buffer-efficiency-tests.patch)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to