[PR] perf(parquet/pqarrow): cap RecordReader batch size to actual row count [arrow-go]

via GitHub Mon, 18 May 2026 02:25:58 -0700


paveon opened a new pull request, #817:
URL: https://github.com/apache/arrow-go/pull/817


   ### Rationale for this change
    `GetRecordReader` passes `BatchSize` directly to the internal `recordReader`
   without capping it to the actual number of rows. When `BatchSize` is 
configured
   to a large value (e.g. 131072) but the file or requested row groups contain
   few rows (e.g. 10), `leafReader.LoadBatch` calls `Reserve(131072)` which
   pre-allocates definition/repetition level buffers and value buffers sized for
   the full batch. For a 200-column int64 table with 10 rows this wastes ~250 MB
   of allocations. 
   
   ### What changes are included in this PR?
   Cap `batchSize` to `NextPowerOf2(nrows)` when a `BatchSize` is explicitly
   configured. The power-of-2 rounding keeps allocations aligned with the
   downstream `updateCapacity` logic that already rounds to powers of two,
   avoiding a redundant reallocation on the first read.
   
   ### Are these changes tested?
   Existing tests pass. The change is on the allocation-sizing path only —
   read correctness is unaffected since `LoadBatch` already stops reading
   when rows are exhausted.
   
   ### Are there any user-facing changes?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf(parquet/pqarrow): cap RecordReader batch size to actual row count [arrow-go]

Reply via email to