paveon opened a new pull request, #831:
URL: https://github.com/apache/arrow-go/pull/831
### Rationale for this change
`determinePageIndexRangesInRowGroup` called
`RowGroupMetaData.ColumnChunk()` for every column in a row group, constructing
a full `ColumnChunkMetaData` (allocating the struct, copying encoding slices,
etc.) just to read two offset/length
pairs. In a compaction workload processing hundreds of small Parquet
files, this path alone accounted for 18.3 GB of allocation churn (36% of total
allocations). The objects were immediately discarded after reading two int
fields.
### What changes are included in this PR?
Added lightweight `ColumnIndexLocation()` and `OffsetIndexLocation()`
methods on `RowGroupMetaData` that read index locations directly from the
underlying thrift struct with zero heap allocations. Updated
`determinePageIndexRangesInRowGroup`
to use these instead of constructing full `ColumnChunkMetaData` objects.
### Are these changes tested?
Yes — existing `parquet/metadata` and `parquet/file` tests pass. Verified
zero heap allocations on the hot path via escape analysis (`go build
-gcflags='-m -m'`). Confirmed with heap profiles that `NewColumnChunkMetaData`
allocations through
this path dropped from 18.3 GB to 106 MB (99.4% reduction).
### Are there any user-facing changes?
Two new public methods on `RowGroupMetaData`: `ColumnIndexLocation(i int)
(IndexLocation, bool)` and `OffsetIndexLocation(i int) (IndexLocation, bool)`.
No breaking changes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]