junyan-ling opened a new issue, #688:
URL: https://github.com/apache/arrow-go/issues/688

   ## Problem
   
   When reading Parquet files with large binary columns, `BinaryBuilder`'s 
internal data buffer starts empty and grows by **repeated doubling** as values 
are appended. For large binary payloads this causes `O(log n)` realloc+copy 
cycles per batch, wasting both time and memory.
   
   This was confirmed by profiling a production workload (`107 row groups × 484 
rows/RG × ~105 MB uncompressed/RG`):
   
   - `bufferBuilder.resize` + `memory.Buffer.resize`: **23% of total CPU time**
   - `GoAllocator.Allocate` (doubled buffers): **45% of heap** — old and new 
buffers live simultaneously during resize, inflating peak memory to 2–3× the 
actual data size
   - `runtime.memmove` + `runtime.memclrNoHeapPointers`: another **~40% of 
CPU** spent copying and zeroing intermediate buffers
   - `runtime.gcDrain`: **16% of CPU** — GC pressure from short-lived 
allocations
   
   ## Proposed Fix
   
   Thread `TotalUncompressedSize` and `NumRows` from the column chunk metadata 
through `columnIterator.NextChunk()` into `leafReader`. At the start of every 
`LoadBatch`, call `BinaryBuilder.ReserveData()` with a proportional estimate:
   
   ```
   reserveBytes = TotalUncompressedSize × min(batchSize, numRows) / numRows
   ```
   
   This pre-allocates the data buffer once per batch before any values are 
appended, eliminating all intermediate realloc+copy cycles.
   
   ## Files Changed
   
   - `parquet/file/record_reader.go` — add `ReserveData(int64)` to 
`BinaryRecordReader` interface; implement on `byteArrayRecordReader` and no-op 
on `flbaRecordReader`
   - `parquet/pqarrow/file_reader.go` — extend `columnIterator.NextChunk()` to 
return `TotalUncompressedSize` and `NumRows` from column chunk metadata
   - `parquet/pqarrow/column_readers.go` — store per-RG metadata in 
`leafReader`; call `reserveBinaryData` at the start of every `LoadBatch`
   
   ## Edge Cases Handled
   
   | Scenario | Behavior |
   |---|---|
   | `batchSize >= rowGroupSize` | Clamps to full RG size — pre-allocates 
entire RG in one shot |
   | `batchSize < rowGroupSize` | Proportional estimate — each batch gets its 
fair share |
   | `batchSize=0` (unbounded `ReadTable`) | Uses full RG size — correct for 
single-pass reads |
   | Multiple batches per row group | Each batch pre-allocated independently 
using stored RG metadata |
   | Mid-batch row group transition | New RG metadata stored on transition, 
next batch uses new estimate |
   | Dictionary-encoded columns | `ReserveData` is no-op — dict builder 
unaffected |
   | Fixed-length columns (FLBA) | `ReserveData` is no-op — no variable-length 
buffer |
   | Null values | `TotalUncompressedSize` excludes nulls — estimate naturally 
correct |
   
   ## Benchmark Results
   
   Environment: Apple M1 Max · Zstd compression · 5% nulls · `count=3` medians
   Schema: single binary column, 2 row groups × 484 rows/RG (matching 
production structure)
   
   | Sub-benchmark | Metric | Before | After | Delta |
   |---|---|---:|---:|---:|
   | `large/batchAll` | ns/op | 9,490,360 | 8,683,674 | -8.5% |
   | | B/op | 142,322 | 113,081 | -20.5% |
   | | allocs/op | 389 | 379 | -2.6% |
   | `large/batchPerRG` | ns/op | 9,583,598 | 8,467,172 | -11.6% |
   | | B/op | 142,324 | 113,081 | -20.5% |
   | | allocs/op | 389 | 378 | -2.8% |
   | `large/batchQuarterRG` | ns/op | 9,335,036 | 8,212,938 | -12.0% |
   | | B/op | 142,323 | 113,080 | -20.5% |
   | | allocs/op | 389 | 378 | -2.8% |
   | `small/batchAll` | ns/op | 475,396 | 409,558 | -13.8% |
   | | B/op | 2,646 | 2,272 | -14.1% |
   | | allocs/op | 436 | 427 | -2.1% |
   | `small/batchPerRG` | ns/op | 473,959 | 400,276 | -15.5% |
   | | B/op | 2,646 | 2,272 | -14.1% |
   | | allocs/op | 436 | 427 | -2.1% |
   | `small/batchQuarterRG` | ns/op | 466,216 | 432,741 | -7.2% |
   | | B/op | 2,646 | 2,272 | -14.1% |
   | | allocs/op | 436 | 427 | -2.1% |
   
   **~20% reduction in `B/op`** across all large cases — doubled buffers 
eliminated.
   **8–16% reduction in `ns/op`** on M1 Max with data in memory. Production 
gains larger due to GC pressure under S3 read concurrency.
   **`batchQuarterRG` matches `batchAll`** — confirms all sub-batches within a 
row group are pre-allocated, not just the first.
   **`small/*` shows no regression** — no overhead for small values.
   
   ## Run Benchmarks
   
   ```
   go test ./parquet/pqarrow/ -run='^$' -bench=BenchmarkReadBinaryColumn 
-benchmem
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to