Re: [PROPOSAL] Add Data Lake operational metrics to Polaris

Pierre Laporte Mon, 22 Sep 2025 06:27:36 -0700

Hello folks

Thanks for your feedback.  I have published a new proposal that I think
will address most of them.  It is available at
https://docs.google.com/document/d/1oFsuI_WKY0QqVqBNS4gtLlDdlZiV9fGmmUG3rLSED4Y/edit?tab=t.0#heading=h.1duembdpfkwi


Notable changes compared to the first proposal:

   1. This proposal describes a collaborative approach where metrics are
   computed exclusively by external services and then pushed to Polaris
   2. Polaris stores and serves those metrics.  But it does not compute any.
   3. As discussed during the community call, this ^ means that even
   trivial metrics are not computed by Polaris (e.g. metrics can be derived
   from the table’s metadata.json).
   4. A reference implementation for certain metric computation may be
   included for illustrative/demo purposes only.  It is not not packaged with
   the Polaris runtime.  It may also not be suited for large tables.
   5. This proposal does not cover Iceberg commit and scan reports

--

Pierre


On Mon, Sep 15, 2025 at 8:58 PM Pierre Laporte <[email protected]>
wrote:

>
>
> On Mon, Sep 15, 2025 at 8:17 PM Yufei Gu <[email protected]> wrote:
>
>> > From my perspective, Polaris should compute the aggregations it needs
>> (like file size distributions)
>>
>> That’d be a pretty big perf hit on Polaris itself, unless you mean Polaris
>> in the broader sense, including peripheral services like TMS.
>
>
> I am not sure I understand what you mean.  Let's imagine that Polaris does
> not compute metrics and instead stores and serves arbitrary numbers
> provided by external services, then the "operational metrics" part of
> Polaris is nothing else than a wrapper around a database.  It does not add
> any value over that database and, in fact, removes value from the database
> given that it will not expose all its configuration parameters.
>
> Regarding the overhead, you are correct that this computation will be done
> by Polaris.  As a result, the resource usage will likely increase.  See the
> sections "threading model" and "deployment options" of the "Design" tab for
> measures that would prevent this overhead from impacting other Catalog
> workloads.
>
> > I don't think clients should be able to push values to Polaris.  Could
>> you elaborate on what you mean by "a new metric write endpoint"?
>>
>> It's in your design doc. Any REST clients can update metrics with this
>> design. I'm with you that we shouldn't do that now, that's also not the
>> only option.
>>
>> Endpoint
>>
>> /v1/{prefix}/namespaces/{namespace}/tables/{table}
>>
>> Method
>>
>> POST
>>
>> Summary
>>
>> Request all metrics of the given table to be updated.
>>
>
> This endpoint is there for external services that need fresh data.  Note
> that it does not allow those external services to push data to Polaris.
>
>
>> > I would not consider serving historical data for now.
>>
>> There’s no way to handle only point-in-time data unless all the metrics
>> we’re talking about come directly from the metadata.json file, such as
>> snapshot summaries.
>> Any async or additional metrics collection requires a time dimension,
>> e.g.,
>> snapshot timestamp is essential to indicate the scope of snapshot/commit
>> metrics.
>
>
> How could we serve historical data (i.e. "the value of this metric was
> [...] as of two days ago") if we cannot compute point-in-time data (i.e.
> "the value of this metric is currently [...]") ?
>
>

Re: [PROPOSAL] Add Data Lake operational metrics to Polaris

Reply via email to