This is an automated email from the ASF dual-hosted git repository.
dheres pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-rs.git
The following commit(s) were added to refs/heads/main by this push:
new ba3446bb90 [Parquet] perf: reuse seeked File clone in
ChunkReader::get_read() (#9214)
ba3446bb90 is described below
commit ba3446bb90cc652a45e909f1caf6a64c39a57609
Author: Florian Valeye <[email protected]>
AuthorDate: Sun Jan 18 21:09:13 2026 +0100
[Parquet] perf: reuse seeked File clone in ChunkReader::get_read() (#9214)
# Which issue does this PR close?
N/A, it's a minor performance fix.
# Rationale for this change
While reviewing Parquet performance, I observed a duplicate
`try_clone()`. I wasn't able to tell why it was required. After
benchmarking and running tests, it seems there is no reason for the
duplication.
`ChunkReader::get_read()` for `File` calls
[`try_clone()`](https://doc.rust-lang.org/std/fs/struct.File.html#method.try_clone)
twice: once to seek, then again for the `BufReader`, discarding the
first clone. This might be wasteful, as each `try_clone()` duplicates
the file descriptor via a system call. So, one less dup() syscall per
get_read() call.
# What changes are included in this PR?
Reuse the already-seeked file clone instead of creating a new one.
# Are these changes tested?
Covered by existing tests.
Local benchmarks using [divan](https://github.com/nvzqz/divan) show ~36%
improvement for `get_read()` calls on my laptop.
# Are there any user-facing changes?
No.
---
parquet/src/file/reader.rs | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/parquet/src/file/reader.rs b/parquet/src/file/reader.rs
index 3adf10fac2..2b3c46f507 100644
--- a/parquet/src/file/reader.rs
+++ b/parquet/src/file/reader.rs
@@ -93,7 +93,7 @@ impl ChunkReader for File {
fn get_read(&self, start: u64) -> Result<Self::T> {
let mut reader = self.try_clone()?;
reader.seek(SeekFrom::Start(start))?;
- Ok(BufReader::new(self.try_clone()?))
+ Ok(BufReader::new(reader))
}
fn get_bytes(&self, start: u64, length: usize) -> Result<Bytes> {