Hey everyone, I'm starting a thread to connect folks interested in improving the existing way of collecting column-level statistics (often referred to as *metrics* in the code). I've already started a proposal, which can be found at https://s.apache.org/iceberg-column-stats.
*Motivation* Column statistics are currently stored as a mapping of field id to values across multiple columns (lower/upper bounds, value/nan/null counts, sizes). This storage model has critical limitations as the number of columns increases and as new types are being added to Iceberg: - Inefficient Storage due to map-based structure: - Large memory overhead during planning/processing - Inability to project specific stats (e.g., only null_value_counts for column X) - Type Erasure: Original logical/physical types are lost when stored as binary blobs, causing: - Lossy type inference during reads - Schema evolution challenges (e.g., widening types) - Rigid Schema: Stats are tied to the data_fil entry record, limiting extensibility for new stats. *Goals* Improve the column stats representation to allow for the following: - Projectability: Enable independent access to specific stats (e.g., lower_bounds without loading upper_bounds). - Type Preservation: Store original data types to support accurate reads and schema evolution. - Flexible/Extensible Representation: Allow per-field stats structures (e.g., complex types like Geo/Variant). Thanks Eduard