Hi everyone, I would like to propose a table-level-only, Spark-focused follow-up to the existing Spark table statistics work.
Today, Iceberg's Spark integration can compute table-level NDV statistics for Iceberg tables. Iceberg stores those Theta sketches in Puffin-format table statistics files and later exposes the NDV value through `SparkScan#estimateStatistics` when Spark CBO and Iceberg column-stat reporting are enabled. That gives Spark CBO distinct counts, but not histograms, even though Spark's `ColumnStatistics` interface has a `histogram()` field. The proposal is: - Keep the existing Theta NDV path unchanged. - Add table-level quantile sketches for selected numeric columns in `compute_table_stats`. - Store those sketches in Puffin-format table statistics files for the target snapshot. - Teach `SparkScan#estimateStatistics` to read and deserialize the sketch payload and expose an approximate equi-height Spark histogram. There has already been related work and discussion around partition-level statistics, as well as PR 8202, which proposed a KLL Puffin blob type. I see this proposal as separate from those efforts: a Spark reference path that shows how table-level quantile sketches could be written and consumed end to end. I think this should start as Spark-only and table-level only. Partition-level stats are important, but Spark's current CBO integration consumes table-level column statistics, so this gives us an implementation that Spark can actually use first. Initial scope would be intentionally narrow: numeric columns, fixed histogram bin count, no Hive `ColumnStatisticsObj`, no partition-level sketch storage, and no new optimizer behavior beyond Spark's existing `ColumnStatistics` path. Does this seem like a reasonable table-level first step? Thanks, Tamas
