Hi Mike,

If you have around a hundred realms or more, imho you are already in
trouble.

The most problematic metric is the http_server_requests_seconds_* metrics
family, e.g.:

http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",}
1.0

Since metrics in this family already have a high cardinality potential
given the number of tags it supports by default, adding one more dimension
to them makes things (exponentially) worse.

It's very easy to demonstrate that by running a small test [1]. On my
machine, the first 2 iterations (10 and 100 realms) complete, but the 3rd
iteration (1000 realms) runs for about 1 minute then ends up in
java.lang.OutOfMemoryError: Java heap space.

That's why I advocated for removing the tag. If however you really want to
keep it, I'd suggest introducing a configuration flag to disable it in two
problematic metric families: the HTTP one shown above, and the per-endpoint
metrics as well.

Thanks,

Alex

[1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d



On Wed, May 21, 2025 at 7:35 AM Michael Collado <collado.m...@gmail.com>
wrote:

> Hmm, we do use the realm tag in our metric publishing. I understand the
> concern re: cardinality. Maybe we can support filtering metrics that have
> realm and support another metric without realm?
>
> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <di...@apache.org>
> wrote:
>
> > Removing realm_id from metrics tags makes sense to me (to avoid high
> > cardinality).
> >
> > If we need to have insight into load differences from realm to realm, it
> > might be preferable to introduce metrics dedicated to that rather than
> > increasing the cardinality of every endpoint metric.
> >
> > Cheers,
> > Dmitri.
> >
> > On Thu, May 15, 2025 at 3:30 PM Alex Dutra <alex.du...@dremio.com.invalid
> >
> > wrote:
> >
> > > Hi all,
> > >
> > > I would like to suggest removing the "realm_id" metric tag entirely.
> > >
> > > My concern is that this tag has the potential for high cardinality,
> which
> > > is generally considered a bad practice when dealing with metrics. High
> > > cardinality can lead to performance issues and increased memory usage.
> > >
> > > Granted, the default realm resolver in Polaris is tailored for just a
> > > handful of realms, but nothing prevents users from declaring hundreds
> of
> > > realms.
> > >
> > > I believe we can still effectively monitor Polaris servers without this
> > > specific tag, since the realm ID is also propagated in traces emitted
> by
> > > Polaris. Tracing is a much better fit for high-cardinality domains.
> > >
> > > I'm open to discussing this further; a potential alternative would be
> to
> > > introduce a flag to disable this specific metric tag, but I feel like
> > > removing it would be a much cleaner approach.
> > >
> > > Let me know your thoughts.
> > >
> > > Thanks,
> > >
> > > Alex
> > >
> >
>

Reply via email to