Re: [PROPOSAL] Add Data Lake operational metrics to Polaris

Pierre Laporte Fri, 03 Oct 2025 01:38:49 -0700

Thanks for your feedback, folks

@Yufei there seems to be a misunderstanding.  As we discussed in this
thread and during previous community calls, the goal of the second proposal
is to start small, and build our way up.  It is not about having a perfect
design document before starting implementation.  Has this changed?


The previous proposal included the possibility to store metrics in the
Metastore.  And there were some concerns about whether we should store just
the latest metric values, or keep track of historical data, should we
choose a database, define retention policies, etc...  All very good
points!  And I think those points that deserve separate discussion.

So the current proposal abstracts that behind an SPI.  In other words, the
current proposal defines the necessary parts that will allow us to plug a
database and data model in a second phase.  The goal is to build consensus,
move forward with the bits we already agree on, and continue iterating
while the implementation is in progress.

This proposal seems to address both. For category (2), we’ll need a clear
> design for how it integrates with an external service. That should cover
> aspects such as workload life cycle management (triggering, state control,
> etc.).


This has to be a separate design document.  The first proposal mentioned an
integration with the Async Tasks framework, eventually.  AFAICT, the
Delegation Service proposal is not merged either.

IMHO, we should not add a dependency between this proposal and other
efforts that are not implemented yet, as it would prevent us from moving
forward on operational metrics until all the pieces are in place.


> That said, I think it would be reasonable to narrow the initial
> scope to category (1). Could we clarify that in the proposal?
>

This is not what the current document says, though.  The updated proposal
defines the backbone that enables Polaris to store and serve metrics.


> On persistence, I believe the most critical part of the design is the
> schema. Once the schema is defined, the SPI details could be derived
> relatively easily. One important factor we shouldn’t overlook is the type
> of database we want to leverage: a time series database (TSDB) or a
> general-purpose OLTP database. This choice will heavily influence schema
> design. For instance, a TSDB schema must include a timestamp, metric name,
> and dimensions, while an OLTP schema is more generic and flexible. The
> choice also affects the SPI design. For example, aggregations, rollups, and
> sliding windows are first-class operations in TSDB, while joins are
> supported better in OLTP. Could we add this consideration to the design
> document?
>

I want to ask, is this really a blocker for the current proposal?  Or can
the current proposal be implemented while we iron out the metrics
persistence details?

I have to ask because you raise very good questions:
* Should we decide now whether the metrics database should be a TSDB or an
OLTP database?
* Should Polaris bundle it in its distribution?
* Or should it include a connector to said database, in which case users
have to provide Polaris with connection parameters?
* The answer to ^ determines whether the retention policy is Polaris'
responsibility

The SPI should not be designed after a single database.  It should be
designed to support the operational metrics service features.  And it
should be abstract enough that it can be extended later to use different
databases.

 - Polaris controls which metrics are calculated and when (benefitting from
> event listeners).
> - Polaris delegates computation to external engines if needed (SPI or
> API?).


@Oleg that is also a very interesting point.  To me, those two points would
make Polaris reimplement a message queue, with external engines requesting
Polaris to trigger metric computation after a certain threshold has
happened (e.g. every x commits on a table or after n minutes at most,
...).  And I wonder whether Polaris' Events could be instead forwarded to
an external message queue that supports well this aggregate/dispatch work.

But as you can guess, it is going to be quite a discussion.  Because
depending on how this discussion goes, Polaris Events interface should be
updated to support external message queues, or Polaris should define a
triggering system that enables external systems to define their own
triggers.



My main point is, those topics are independent of the metrics REST API
definition, the RBAC integration, ... parts that we have consensus on.
I.e. start small and iterate.

Wdyt?

Re: [PROPOSAL] Add Data Lake operational metrics to Polaris

Reply via email to