Hi Mike, If you have around a hundred realms or more, imho you are already in trouble.
The most problematic metric is the http_server_requests_seconds_* metrics family, e.g.: http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",} 1.0 Since metrics in this family already have a high cardinality potential given the number of tags it supports by default, adding one more dimension to them makes things (exponentially) worse. It's very easy to demonstrate that by running a small test [1]. On my machine, the first 2 iterations (10 and 100 realms) complete, but the 3rd iteration (1000 realms) runs for about 1 minute then ends up in java.lang.OutOfMemoryError: Java heap space. That's why I advocated for removing the tag. If however you really want to keep it, I'd suggest introducing a configuration flag to disable it in two problematic metric families: the HTTP one shown above, and the per-endpoint metrics as well. Thanks, Alex [1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d On Wed, May 21, 2025 at 7:35 AM Michael Collado <collado.m...@gmail.com> wrote: > Hmm, we do use the realm tag in our metric publishing. I understand the > concern re: cardinality. Maybe we can support filtering metrics that have > realm and support another metric without realm? > > On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <di...@apache.org> > wrote: > > > Removing realm_id from metrics tags makes sense to me (to avoid high > > cardinality). > > > > If we need to have insight into load differences from realm to realm, it > > might be preferable to introduce metrics dedicated to that rather than > > increasing the cardinality of every endpoint metric. > > > > Cheers, > > Dmitri. > > > > On Thu, May 15, 2025 at 3:30 PM Alex Dutra <alex.du...@dremio.com.invalid > > > > wrote: > > > > > Hi all, > > > > > > I would like to suggest removing the "realm_id" metric tag entirely. > > > > > > My concern is that this tag has the potential for high cardinality, > which > > > is generally considered a bad practice when dealing with metrics. High > > > cardinality can lead to performance issues and increased memory usage. > > > > > > Granted, the default realm resolver in Polaris is tailored for just a > > > handful of realms, but nothing prevents users from declaring hundreds > of > > > realms. > > > > > > I believe we can still effectively monitor Polaris servers without this > > > specific tag, since the realm ID is also propagated in traces emitted > by > > > Polaris. Tracing is a much better fit for high-cardinality domains. > > > > > > I'm open to discussing this further; a potential alternative would be > to > > > introduce a flag to disable this specific metric tag, but I feel like > > > removing it would be a much cleaner approach. > > > > > > Let me know your thoughts. > > > > > > Thanks, > > > > > > Alex > > > > > >