zilder opened a new issue, #4118:
URL: https://github.com/apache/arrow-rs/issues/4118
**Describe the bug**
Not sure that it's a bug, but it seems that `arrow-rs` version `37` performs
more read operations from parquet files compared to version `19` (which we have
been using so far). Some of the byte ranges seem to be overlapping (see the
output below). For the context we use a custom implementation of `ChunkReader`
with `ParquetRecordBatchReader` (and with `SerializedFileReader` in `v19`) to
access S3 storage. Here's a reduced implementation:
```rust
pub struct S3Request {
client: Client,
bucket: String,
key: String,
len: u64,
rt: Runtime,
}
impl ChunkReader for S3Request {
type T = ByteBuf;
fn get_read(
&self,
start: u64,
length: usize,
) -> Result<Self::T, parquet::errors::ParquetError> {
let end = start + length as u64 - 1;
println!("S3Request::get_read(): {}, {}", start, end);
let data = self
.rt
.block_on(async {
let resp = match self
.client
.get_object()
.bucket(&self.bucket)
.key(&self.key)
.range(format!("bytes={}-{}", start, end))
.send()
.await
{
Ok(r) => r,
Err(e) => {
panic!("{}", e);
},
};
resp.body.collect().await
})
.unwrap();
Ok(ByteBuf(data))
}
}
```
(I added `println!("S3Request::get_read(): {}, {}", start, end);` to track
each read operations)
In the output we get 8 read operations (`v37`):
```
S3Request::get_read(): 2359, 2366
S3Request::get_read(): 435, 2358
S3Request::get_read(): 4, 121
S3Request::get_read(): 18, 121
S3Request::get_read(): 43, 121
S3Request::get_read(): 214, 331
S3Request::get_read(): 228, 331
S3Request::get_read(): 253, 331
+----+-------------+
| ts | temperature |
+----+-------------+
| 1 | 111 |
| 5 | 555 |
+----+-------------+
```
While with the same implementation we only get 4 read operations using
`SerializedFileReader` and `ParquetFileArrowReader` (in `v19`):
```
S3Request::get_read(): 2359, 2366
S3Request::get_read(): 435, 2358
S3Request::get_read(): 4, 121
S3Request::get_read(): 214, 331
+----+-------------+
| ts | temperature |
+----+-------------+
| 1 | 111 |
| 5 | 555 |
+----+-------------+
```
Was that an intended change?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]