zilder opened a new issue, #4118:
URL: https://github.com/apache/arrow-rs/issues/4118

   **Describe the bug**
   Not sure that it's a bug, but it seems that `arrow-rs` version `37` performs 
more read operations from parquet files compared to version `19` (which we have 
been using so far). Some of the byte ranges seem to be overlapping (see the 
output below). For the context we use a custom implementation of `ChunkReader` 
with `ParquetRecordBatchReader` (and with `SerializedFileReader` in `v19`) to 
access S3 storage. Here's a reduced implementation:
   
   ```rust
   pub struct S3Request {
       client: Client,
       bucket: String,
       key: String,
       len: u64,
       rt: Runtime,
   }
   
   impl ChunkReader for S3Request {
       type T = ByteBuf;
   
       fn get_read(
           &self,
           start: u64,
           length: usize,
       ) -> Result<Self::T, parquet::errors::ParquetError> {
           let end = start + length as u64 - 1;
           println!("S3Request::get_read(): {}, {}", start, end);
   
           let data = self
               .rt
               .block_on(async {
                   let resp = match self
                       .client
                       .get_object()
                       .bucket(&self.bucket)
                       .key(&self.key)
                       .range(format!("bytes={}-{}", start, end))
                       .send()
                       .await
                   {
                       Ok(r) => r,
                       Err(e) => {
                           panic!("{}", e);
                       },
                   };
   
                   resp.body.collect().await
               })
               .unwrap();
   
           Ok(ByteBuf(data))
       }
   }
   ```
   (I added `println!("S3Request::get_read(): {}, {}", start, end);` to track 
each read operations)
   
   In the output we get 8 read operations (`v37`):
   ```
   S3Request::get_read(): 2359, 2366
   S3Request::get_read(): 435, 2358
   S3Request::get_read(): 4, 121
   S3Request::get_read(): 18, 121
   S3Request::get_read(): 43, 121
   S3Request::get_read(): 214, 331
   S3Request::get_read(): 228, 331
   S3Request::get_read(): 253, 331
   +----+-------------+
   | ts | temperature |
   +----+-------------+
   | 1  | 111         |
   | 5  | 555         |
   +----+-------------+
   ```
   While with the same implementation we only get 4 read operations using 
`SerializedFileReader` and `ParquetFileArrowReader` (in `v19`):
   ```
   S3Request::get_read(): 2359, 2366
   S3Request::get_read(): 435, 2358
   S3Request::get_read(): 4, 121
   S3Request::get_read(): 214, 331
   +----+-------------+
   | ts | temperature |
   +----+-------------+
   | 1  | 111         |
   | 5  | 555         |
   +----+-------------+
   ```
   
   Was that an intended change?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to