This is an automated email from the ASF dual-hosted git repository.
zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-go.git
The following commit(s) were added to refs/heads/main by this push:
new 086b8d2c perf(parquet/pqarrow): cap RecordReader batch size to actual
row count (#817)
086b8d2c is described below
commit 086b8d2c78a893ebc40c71636451bb3a2c54a95b
Author: Ondřej Pavela <[email protected]>
AuthorDate: Wed May 20 17:58:42 2026 +0200
perf(parquet/pqarrow): cap RecordReader batch size to actual row count
(#817)
### Rationale for this change
`GetRecordReader` passes `BatchSize` directly to the internal
`recordReader`
without capping it to the actual number of rows. When `BatchSize` is
configured
to a large value (e.g. 131072) but the file or requested row groups
contain
few rows (e.g. 10), `leafReader.LoadBatch` calls `Reserve(131072)` which
pre-allocates definition/repetition level buffers and value buffers
sized for
the full batch. For a 200-column int64 table with 10 rows this wastes
~250 MB
of allocations.
### What changes are included in this PR?
Cap `batchSize` to `NextPowerOf2(nrows)` when a `BatchSize` is
explicitly
configured. The power-of-2 rounding keeps allocations aligned with the
downstream `updateCapacity` logic that already rounds to powers of two,
avoiding a redundant reallocation on the first read.
### Are these changes tested?
Existing tests pass. The change is on the allocation-sizing path only —
read correctness is unaffected since `LoadBatch` already stops reading
when rows are exhausted.
### Are there any user-facing changes?
No
---------
Co-authored-by: Ondřej Pavela <[email protected]>
---
parquet/pqarrow/file_reader.go | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/parquet/pqarrow/file_reader.go b/parquet/pqarrow/file_reader.go
index 34992f8a..b4e46008 100644
--- a/parquet/pqarrow/file_reader.go
+++ b/parquet/pqarrow/file_reader.go
@@ -28,6 +28,7 @@ import (
"github.com/apache/arrow-go/v18/arrow"
"github.com/apache/arrow-go/v18/arrow/array"
"github.com/apache/arrow-go/v18/arrow/arrio"
+ "github.com/apache/arrow-go/v18/arrow/bitutil"
"github.com/apache/arrow-go/v18/arrow/memory"
"github.com/apache/arrow-go/v18/internal/utils"
"github.com/apache/arrow-go/v18/parquet"
@@ -516,9 +517,11 @@ func (fr *FileReader) GetRecordReader(ctx context.Context,
colIndices, rowGroups
nrows += fr.rdr.MetaData().RowGroup(rg).NumRows()
}
- batchSize := fr.Props.BatchSize
+ var batchSize int64
if fr.Props.BatchSize <= 0 {
batchSize = nrows
+ } else {
+ batchSize = min(fr.Props.BatchSize,
int64(bitutil.NextPowerOf2(int(nrows))))
}
rr := &recordReader{
numRows: nrows,