(arrow-go) branch main updated: perf(parquet/pqarrow): cap RecordReader batch size to actual row count (#817)

zeroshade Wed, 20 May 2026 09:00:34 -0700

This is an automated email from the ASF dual-hosted git repository.

zeroshade pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-go.git



The following commit(s) were added to refs/heads/main by this push:
     new 086b8d2c perf(parquet/pqarrow): cap RecordReader batch size to actual 
row count (#817)
086b8d2c is described below

commit 086b8d2c78a893ebc40c71636451bb3a2c54a95b
Author: Ondřej Pavela <[email protected]>
AuthorDate: Wed May 20 17:58:42 2026 +0200

    perf(parquet/pqarrow): cap RecordReader batch size to actual row count 
(#817)
    
    ### Rationale for this change
    `GetRecordReader` passes `BatchSize` directly to the internal
    `recordReader`
    without capping it to the actual number of rows. When `BatchSize` is
    configured
    to a large value (e.g. 131072) but the file or requested row groups
    contain
    few rows (e.g. 10), `leafReader.LoadBatch` calls `Reserve(131072)` which
    pre-allocates definition/repetition level buffers and value buffers
    sized for
    the full batch. For a 200-column int64 table with 10 rows this wastes
    ~250 MB
    of allocations.
    
    ### What changes are included in this PR?
    Cap `batchSize` to `NextPowerOf2(nrows)` when a `BatchSize` is
    explicitly
    configured. The power-of-2 rounding keeps allocations aligned with the
    downstream `updateCapacity` logic that already rounds to powers of two,
    avoiding a redundant reallocation on the first read.
    
    ### Are these changes tested?
    Existing tests pass. The change is on the allocation-sizing path only —
    read correctness is unaffected since `LoadBatch` already stops reading
    when rows are exhausted.
    
    ### Are there any user-facing changes?
    No
    
    ---------
    
    Co-authored-by: Ondřej Pavela <[email protected]>
---
 parquet/pqarrow/file_reader.go | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/parquet/pqarrow/file_reader.go b/parquet/pqarrow/file_reader.go
index 34992f8a..b4e46008 100644
--- a/parquet/pqarrow/file_reader.go
+++ b/parquet/pqarrow/file_reader.go
@@ -28,6 +28,7 @@ import (
        "github.com/apache/arrow-go/v18/arrow"
        "github.com/apache/arrow-go/v18/arrow/array"
        "github.com/apache/arrow-go/v18/arrow/arrio"
+       "github.com/apache/arrow-go/v18/arrow/bitutil"
        "github.com/apache/arrow-go/v18/arrow/memory"
        "github.com/apache/arrow-go/v18/internal/utils"
        "github.com/apache/arrow-go/v18/parquet"
@@ -516,9 +517,11 @@ func (fr *FileReader) GetRecordReader(ctx context.Context, 
colIndices, rowGroups
                nrows += fr.rdr.MetaData().RowGroup(rg).NumRows()
        }
 
-       batchSize := fr.Props.BatchSize
+       var batchSize int64
        if fr.Props.BatchSize <= 0 {
                batchSize = nrows
+       } else {
+               batchSize = min(fr.Props.BatchSize, 
int64(bitutil.NextPowerOf2(int(nrows))))
        }
        rr := &recordReader{
                numRows:      nrows,

(arrow-go) branch main updated: perf(parquet/pqarrow): cap RecordReader batch size to actual row count (#817)

Reply via email to