[I] perf(manifest): dataFile.initializeMapData eagerly allocates all column-stats maps on first accessor call [iceberg-go]

via GitHub Sat, 30 May 2026 22:42:38 -0700


paveon opened a new issue, #1142:
URL: https://github.com/apache/iceberg-go/issues/1142


   ### Feature Request / Improvement
   
   ### Description:
   When profiling workloads that read manifest entries (compaction, orphan 
cleanup), map allocations from `dataFile.initializeMapData()` accounted for 
~45% of allocated memory in our use case.
   The root cause is that it uses a single `sync.Once` to convert all 7 
column-stats maps plus the
   partition map on the first call to any accessor — even when the caller only 
needs `Partition()`.
   
   Most code paths outside of table scanning only need partition data, but 
currently pay the full cost of allocating and populating maps for 
`ColumnSizes`, `ValueCounts`, `NullValueCounts`, `NaNValueCounts`, 
`DistinctValueCounts`, `LowerBoundValues`, and `UpperBoundValues`.
   
   Additionally, the underlying `avroColMapToMap` helper does not preallocate 
the output map, causing repeated rehashing as entries are inserted.
   
   ### Proposed fix:
     1. Split the single `sync.Once` into two — one for partition data, one for 
column stats — so each accessor only triggers the initialization it needs.
     2. Preallocate the map in `avroColMapToMap` using the known slice length.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] perf(manifest): dataFile.initializeMapData eagerly allocates all column-stats maps on first accessor call [iceberg-go]

Reply via email to