Hey everyone,

I'm starting a thread to connect folks interested in improving the existing
way of collecting column-level statistics (often referred to as *metrics*
in the code). I've already started a proposal, which can be found at
https://s.apache.org/iceberg-column-stats.

*Motivation*

Column statistics are currently stored as a mapping of field id to values
across multiple columns (lower/upper bounds, value/nan/null counts, sizes).
This storage model has critical limitations as the number of columns
increases and as new types are being added to Iceberg:

   -

   Inefficient Storage due to map-based structure:
   -

      Large memory overhead during planning/processing
      -

      Inability to project specific stats (e.g., only null_value_counts for
      column X)
      -

   Type Erasure: Original logical/physical types are lost when stored as
   binary blobs, causing:
   -

      Lossy type inference during reads
      - Schema evolution challenges (e.g., widening types)
   - Rigid Schema: Stats are tied to the data_fil entry record, limiting
   extensibility for new stats.


*Goals*

Improve the column stats representation to allow for the following:

   -

   Projectability: Enable independent access to specific stats (e.g.,
   lower_bounds without loading upper_bounds).
   -

   Type Preservation: Store original data types to support accurate reads
   and schema evolution.
   -

   Flexible/Extensible Representation: Allow per-field stats structures
   (e.g., complex types like Geo/Variant).



Thanks
Eduard

Reply via email to