Hi all,
I'm working on a feature for Apache Polaris to persist Iceberg metrics
(ScanReport and CommitReport) to a database for observability and analytics
purposes.
While implementing this, I noticed an asymmetry between the two report types:
• CommitReport includes detailed row-level metrics via
CommitMetricsResult: addedRecords, deletedRecords, addedPositionalDeletes,
addedEqualityDeletes, etc.
• ScanReport captures planning metrics via ScanMetricsResult (result
data/delete files, scanned/skipped data manifests, etc.) but doesn't include
the actual number of rows returned by the scan.
I understand that ScanReport is generated during the planning phase, so the
actual row count isn't known at that point. However, for observability use cases
(query analytics, capacity planning, detecting expensive scans), knowing how
many rows were actually read would be valuable.
A few questions for the community:
1. Is there a technical reason why row counts weren't included in
ScanReport? Is it because the metrics reporter doesn't have visibility into the
execution
phase?
2. Would there be interest in extending the metrics API to support
post-execution scan metrics that include row counts? This could potentially be:
• A new ScanCompletionReport sent after execution
• An optional field in ScanReport that could be populated if available
• A callback mechanism for engines to report execution-time metrics
3. How do others currently track rows-read metrics for Iceberg tables? Are
there engine-specific solutions (Spark metrics, Trino query stats, etc.) that
people
rely on instead?
For context, our use case is building a centralized metrics store where
teams can analyze table access patterns, identify hot tables, and understand
read/write workloads across their data lake.
Any insights or suggestions would be appreciated!
Thanks,
-
Anand K Sankaran
Workday Data Cloud