paveon opened a new pull request, #831:
URL: https://github.com/apache/arrow-go/pull/831

     ### Rationale for this change
   
     `determinePageIndexRangesInRowGroup` called 
`RowGroupMetaData.ColumnChunk()` for every column in a row group, constructing 
a full `ColumnChunkMetaData` (allocating the struct, copying encoding slices, 
etc.) just to read two offset/length
     pairs. In a compaction workload processing hundreds of small Parquet 
files, this path alone accounted for 18.3 GB of allocation churn (36% of total 
allocations). The objects were immediately discarded after reading two int 
fields.
   
     ### What changes are included in this PR?
   
     Added lightweight `ColumnIndexLocation()` and `OffsetIndexLocation()` 
methods on `RowGroupMetaData` that read index locations directly from the 
underlying thrift struct with zero heap allocations. Updated 
`determinePageIndexRangesInRowGroup`
     to use these instead of constructing full `ColumnChunkMetaData` objects.
   
     ### Are these changes tested?
   
     Yes — existing `parquet/metadata` and `parquet/file` tests pass. Verified 
zero heap allocations on the hot path via escape analysis (`go build 
-gcflags='-m -m'`). Confirmed with heap profiles that `NewColumnChunkMetaData` 
allocations through
     this path dropped from 18.3 GB to 106 MB (99.4% reduction).
   
     ### Are there any user-facing changes?
   
     Two new public methods on `RowGroupMetaData`: `ColumnIndexLocation(i int) 
(IndexLocation, bool)` and `OffsetIndexLocation(i int) (IndexLocation, bool)`. 
No breaking changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to