Re: [PROPOSAL] Add Data Lake operational metrics to Polaris

Pierre Laporte Mon, 15 Sep 2025 05:39:39 -0700

Thanks for the feedback, Yufei

>From my perspective, Polaris should compute the aggregations it needs (like
file size distributions) and serve those metrics in a read-only fashion to
the outside world.  I don't think clients should be able to push values to
Polaris.  Could you elaborate on what you mean by "a new metric write
endpoint"?


Related to current vs. historical data, I would not consider serving
historical data for now.  It could definitely become a use case we want to
support in the future.  But given we are at the beginning of this
initiative, I would defer it to clients for the time being.

Think: Prometheus Node Exporter only serves point-in-time data.  And it is
up to the metric consumer (Prometheus Server and its configuration) to
define how often it wants metrics collected, deal with storage, retention
policy, etc.  This reduces the overhead of the Node Exporter.  And I think
we could start with a similar approach for Polaris.

--

Pierre


On Sat, Sep 6, 2025 at 1:46 AM Yufei Gu <[email protected]> wrote:

> Thanks, Pierre, for the proposal. I’m excited about the potential of
> serving these metrics via Polaris. They would be highly valuable for
> multiple use cases, including UI integration, TMS (deciding when and how to
> compact a table), monitoring, cost awareness (through table size trending),
> and query performance debugging (e.g., identifying data skewness).
>
> That said, the approach to collecting metrics isn’t entirely
> straightforward at this point. I see a few possible scenarios:
>
>    1. Many metrics can be derived from the table’s metadata.json (e.g.,
>    number of snapshots, rows, files, etc.). These can be served as
>    point-in-time values without requiring additional collection or
>    persistence—unless we need historical tracking.
>    2. Some metrics, like partition statistics or file size distributions,
>    would require asynchronous collection pipelines to process and store
> them
>    efficiently.
>    3. Certain metrics can naturally flow through existing metrics
>    endpoints, such as scan reports or commit reports.
>
> Please note that: only option 2 may require a new metric write endpoint.
>
> On current vs. historical data:
>
>    1. Serving point-in-time data is simple and, in some cases, doesn’t even
>    require persistence.
>    2. Serving historical data with aggregation, however, introduces more
>    complexity. For this, existing time-series databases are usually better
>    suited, as they are optimized for retention and rollups.
>
>
> I think we could start small by introducing a metrics read endpoint, and
> then figure out a persistence strategy as we go, for example, leveraging
> time-series databases to better serve historical data.
>
>
> Yufei
>
>
> On Fri, Sep 5, 2025 at 1:16 PM Pierre Laporte <[email protected]>
> wrote:
>
> > I was definitely not aware of that endpoint, so thanks a lot for bringing
> > that up !  I am glad there is appetite for even more metrics :-)
> >
> > One thing that I was trying to be mindful about is the extra load that
> the
> > MetaStore will have to handle.  Typically, assuming ~10 metrics per
> table,
> > this could become already quite substantial for large Data Lakes.  And
> > given that some requests imply to have a couple of metrics per-partition,
> > that scales up even more.
> >
> > That being said, I am definitely in favor of recording the metrics sent
> by
> > Iceberg clients in a database.  If this can add more value to Polaris, by
> > all means I am in.
> >
> > I wonder how we could best anticipate the volume that this could result
> > to.  For example, considering that metrics have a different lifecycle
> than
> > table metadata, then they should probably not be cached in the
> EntityCache
> > at all.  Otherwise, they could easily thrash the EntityCache.  And we
> might
> > also want to store them in a separate table, or even database (?),
> > depending on other constraint (e.g. leveraging a TTL for automatic
> > cleanup).
> >
> > --
> >
> > Pierre
> >
> >
> > On Fri, Sep 5, 2025 at 8:05 PM Prashant Singh <[email protected]>
> > wrote:
> >
> > > Hey Pierre,
> > > Thank you for taking a look at my recommendation, I think there are
> > > additional benefits of these Iceberg metrics for example ScanMetrics we
> > > literally get the expression that was applied to the query which
> > > essentially can help us get which subset of data is actively queried
> and
> > > hence run compaction on it.
> > > People already build their telemetry and triggers based on these
> reports,
> > > since this is something iceberg natively provides.
> > >
> > > That being said, I am not against the idea of collecting telemetry (I
> > think
> > > we would require an auxiliary compute for doing this, though), but I
> > wanted
> > > to highlight something very obvious that Polaris might be ignoring and
> > > introduce a new one as I didn't find the reference in the proposal !
> > > SideNode catalogs such as Apache Gravitino already supports this [PR
> > > <https://github.com/apache/gravitino/pull/1164/files>]
> > >
> > > >  cannot find anything in the community Slack about people requesting
> > > Polaris to support Iceberg Metrics, since we are on the Free plan
> > >
> > > unfortunately I don't have access to message either, but the context
> was
> > a
> > > Polaris user was asking why isn't Persisting the report which is sent
> to
> > > `/report` and how can they get that report, to which i suggested them
> to
> > > write their own custom metric reporter which rather than hitting the
> > > /report endpoint of Polaris it just dumps data to a DB which their
> > > downstream maintainer services can use.
> > >
> > > Looking forward to discussing this more !
> > >
> > > Best,
> > > Prashant Singh
> > >
> > >
> > > On Fri, Sep 5, 2025 at 5:03 AM Pierre Laporte <[email protected]>
> > > wrote:
> > >
> > > > Thanks for the feedback, Prashant
> > > >
> > > > As far as I can tell, we could use the Iceberg Metrics Reporting for
> > > only 3
> > > > operational metrics:
> > > > * Total number of files in a table (using the CommitReport)
> > > > * Total number of reads (the number of ScanReport)
> > > > * Total number of writes (the number of CommitReport)
> > > >
> > > > I don't think the other operational metrics could be computed from
> the
> > > > Iceberg Metrics.  So we would still need to rely on the Events API.
> > And
> > > I
> > > > am wondering whether we should really have two triggers to compute
> > > metrics,
> > > > considering that with the Events API, we would be able to cover all
> > > > documented cases.
> > > >
> > > > That being said, I suspect that there could be other operational
> > metrics
> > > > that are missing from the design document.  Typically metrics that
> > would
> > > > require the use of the Iceberg Metrics Reporting.  Problem: I cannot
> > find
> > > > anything in the community Slack about people requesting Polaris to
> > > support
> > > > Iceberg Metrics, since we are on the Free plan.  Do you happen to
> > > remember
> > > > what was discussed?
> > > >
> > > > --
> > > >
> > > > Pierre
> > > >
> > > >
> > > > On Thu, Sep 4, 2025 at 6:27 PM Prashant Singh
> > > > <[email protected]> wrote:
> > > >
> > > > > Thank you for the proposal Pierre !
> > > > > I think having metrics on the entities that Polaris is really
> helpful
> > > for
> > > > > telemetry as well making decisions on when and what partitions to
> run
> > > > > compactions.
> > > > > Iceberg already emits the metric from client end to the rest server
> > > > > via RestMetricsReporter
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/rest/RESTMetricsReporter.java#L60
> > > > > >
> > > > > and
> > > > > things like ScanMetrics
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/metrics/ScanMetrics.java
> > > > > >
> > > > > /
> > > > > CommitMetrics
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/metrics/CommitMetrics.java
> > > > > >
> > > > > are already available but at this point we don't persist them and
> > hence
> > > > > they are lost, there has been a request for this in Polaris slack
> > too !
> > > > > My recommendations would start from here !
> > > > >
> > > > > Best,
> > > > > Prashant Singh
> > > > >
> > > > > On Thu, Sep 4, 2025 at 8:41 AM Pierre Laporte <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi folks,
> > > > > >
> > > > > > I would like to propose the addition of a component to Polaris
> that
> > > > would
> > > > > > build and maintain operational metrics for the Data Lake tables
> and
> > > > > views.
> > > > > > The main idea is that, if those metrics can be shared across
> > multiple
> > > > > Table
> > > > > > Management Services and/or other external services, then it would
> > > make
> > > > > > sense to have those metrics served by Polaris.
> > > > > >
> > > > > > I believe this feature would nor only add value to Polaris but
> also
> > > > > further
> > > > > > advance it as central point in the Data Lake.
> > > > > >
> > > > > > The detailed proposal document is here:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1yHvLwqNVD3Z84KYcc_m3c4M8bMijTXg9iP1CR0JXxCc
> > > > > >
> > > > > > Please let me know if you have any feedback or comment !
> > > > > >
> > > > > > Thanks
> > > > > > --
> > > > > >
> > > > > > Pierre
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [PROPOSAL] Add Data Lake operational metrics to Polaris

Reply via email to