Hi everyone,

I would like to propose a table-level-only, Spark-focused follow-up to the
existing Spark table statistics work.

Today, Iceberg's Spark integration can compute table-level NDV statistics
for Iceberg tables. Iceberg stores those Theta sketches in Puffin-format
table statistics files and later exposes the NDV value through
`SparkScan#estimateStatistics` when Spark CBO and Iceberg column-stat
reporting are enabled. That gives Spark CBO distinct counts, but not
histograms, even though Spark's `ColumnStatistics` interface has a
`histogram()` field.

The proposal is:

- Keep the existing Theta NDV path unchanged.
- Add table-level quantile sketches for selected numeric columns in
`compute_table_stats`.
- Store those sketches in Puffin-format table statistics files for the
target snapshot.
- Teach `SparkScan#estimateStatistics` to read and deserialize the sketch
payload and expose an approximate equi-height Spark histogram.

There has already been related work and discussion around partition-level
statistics, as well as PR 8202, which proposed a KLL Puffin blob type. I
see this proposal as separate from those efforts: a Spark reference path
that shows how table-level quantile sketches could be written and consumed
end to end.

I think this should start as Spark-only and table-level only.
Partition-level stats are important, but Spark's current CBO integration
consumes table-level column statistics, so this gives us an implementation
that Spark can actually use first.

Initial scope would be intentionally narrow: numeric columns, fixed
histogram bin count, no Hive `ColumnStatisticsObj`, no partition-level
sketch storage, and no new optimizer behavior beyond Spark's existing
`ColumnStatistics` path.

Does this seem like a reasonable table-level first step?

Thanks,
Tamas

Reply via email to