junyan-ling opened a new issue, #688: URL: https://github.com/apache/arrow-go/issues/688
## Problem When reading Parquet files with large binary columns, `BinaryBuilder`'s internal data buffer starts empty and grows by **repeated doubling** as values are appended. For large binary payloads this causes `O(log n)` realloc+copy cycles per batch, wasting both time and memory. This was confirmed by profiling a production workload (`107 row groups × 484 rows/RG × ~105 MB uncompressed/RG`): - `bufferBuilder.resize` + `memory.Buffer.resize`: **23% of total CPU time** - `GoAllocator.Allocate` (doubled buffers): **45% of heap** — old and new buffers live simultaneously during resize, inflating peak memory to 2–3× the actual data size - `runtime.memmove` + `runtime.memclrNoHeapPointers`: another **~40% of CPU** spent copying and zeroing intermediate buffers - `runtime.gcDrain`: **16% of CPU** — GC pressure from short-lived allocations ## Proposed Fix Thread `TotalUncompressedSize` and `NumRows` from the column chunk metadata through `columnIterator.NextChunk()` into `leafReader`. At the start of every `LoadBatch`, call `BinaryBuilder.ReserveData()` with a proportional estimate: ``` reserveBytes = TotalUncompressedSize × min(batchSize, numRows) / numRows ``` This pre-allocates the data buffer once per batch before any values are appended, eliminating all intermediate realloc+copy cycles. ## Files Changed - `parquet/file/record_reader.go` — add `ReserveData(int64)` to `BinaryRecordReader` interface; implement on `byteArrayRecordReader` and no-op on `flbaRecordReader` - `parquet/pqarrow/file_reader.go` — extend `columnIterator.NextChunk()` to return `TotalUncompressedSize` and `NumRows` from column chunk metadata - `parquet/pqarrow/column_readers.go` — store per-RG metadata in `leafReader`; call `reserveBinaryData` at the start of every `LoadBatch` ## Edge Cases Handled | Scenario | Behavior | |---|---| | `batchSize >= rowGroupSize` | Clamps to full RG size — pre-allocates entire RG in one shot | | `batchSize < rowGroupSize` | Proportional estimate — each batch gets its fair share | | `batchSize=0` (unbounded `ReadTable`) | Uses full RG size — correct for single-pass reads | | Multiple batches per row group | Each batch pre-allocated independently using stored RG metadata | | Mid-batch row group transition | New RG metadata stored on transition, next batch uses new estimate | | Dictionary-encoded columns | `ReserveData` is no-op — dict builder unaffected | | Fixed-length columns (FLBA) | `ReserveData` is no-op — no variable-length buffer | | Null values | `TotalUncompressedSize` excludes nulls — estimate naturally correct | ## Benchmark Results Environment: Apple M1 Max · Zstd compression · 5% nulls · `count=3` medians Schema: single binary column, 2 row groups × 484 rows/RG (matching production structure) | Sub-benchmark | Metric | Before | After | Delta | |---|---|---:|---:|---:| | `large/batchAll` | ns/op | 9,490,360 | 8,683,674 | -8.5% | | | B/op | 142,322 | 113,081 | -20.5% | | | allocs/op | 389 | 379 | -2.6% | | `large/batchPerRG` | ns/op | 9,583,598 | 8,467,172 | -11.6% | | | B/op | 142,324 | 113,081 | -20.5% | | | allocs/op | 389 | 378 | -2.8% | | `large/batchQuarterRG` | ns/op | 9,335,036 | 8,212,938 | -12.0% | | | B/op | 142,323 | 113,080 | -20.5% | | | allocs/op | 389 | 378 | -2.8% | | `small/batchAll` | ns/op | 475,396 | 409,558 | -13.8% | | | B/op | 2,646 | 2,272 | -14.1% | | | allocs/op | 436 | 427 | -2.1% | | `small/batchPerRG` | ns/op | 473,959 | 400,276 | -15.5% | | | B/op | 2,646 | 2,272 | -14.1% | | | allocs/op | 436 | 427 | -2.1% | | `small/batchQuarterRG` | ns/op | 466,216 | 432,741 | -7.2% | | | B/op | 2,646 | 2,272 | -14.1% | | | allocs/op | 436 | 427 | -2.1% | **~20% reduction in `B/op`** across all large cases — doubled buffers eliminated. **8–16% reduction in `ns/op`** on M1 Max with data in memory. Production gains larger due to GC pressure under S3 read concurrency. **`batchQuarterRG` matches `batchAll`** — confirms all sub-batches within a row group are pre-allocated, not just the first. **`small/*` shows no regression** — no overhead for small values. ## Run Benchmarks ``` go test ./parquet/pqarrow/ -run='^$' -bench=BenchmarkReadBinaryColumn -benchmem ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
