Re: [DISCUSS] Remove realm_id metric tag

Alex Dutra Wed, 21 May 2025 06:40:13 -0700

Also worth noting: http_server_requests_seconds_* metrics are generated
even for failed requests. So it would be very easy for an attacker to forge
HTTP requests containing invalid realms and take the server down.


Alex

On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de> wrote:

> OOMs are bad - it's extremely hard (if not impossible) for any user to
> figure out why that happens. Bug reports would just read like "Polaris
> runs into an OOM - no further information available".
>
> IMHO the system should be stable by default, not the other way around.
> Even if there is a way to enable such huge cardinalities, that flag
> should really be hidden and not documented anywhere - unless there's a
> clear disclaimer saying: "Enabling this can lead to an unresponsive
> system and risk the stability of the system. ..."
>
> I propose to investigate all metrics and reduce the cardinality of all
> of those as that can easily become a serious issue.
>
> On 21.05.25 11:42, Alex Dutra wrote:
> > Hi Mike,
> >
> > If you have around a hundred realms or more, imho you are already in
> > trouble.
> >
> > The most problematic metric is the http_server_requests_seconds_* metrics
> > family, e.g.:
> >
> >
> http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",}
> > 1.0
> >
> > Since metrics in this family already have a high cardinality potential
> > given the number of tags it supports by default, adding one more
> dimension
> > to them makes things (exponentially) worse.
> >
> > It's very easy to demonstrate that by running a small test [1]. On my
> > machine, the first 2 iterations (10 and 100 realms) complete, but the 3rd
> > iteration (1000 realms) runs for about 1 minute then ends up in
> > java.lang.OutOfMemoryError: Java heap space.
> >
> > That's why I advocated for removing the tag. If however you really want
> to
> > keep it, I'd suggest introducing a configuration flag to disable it in
> two
> > problematic metric families: the HTTP one shown above, and the
> per-endpoint
> > metrics as well.
> >
> > Thanks,
> >
> > Alex
> >
> > [1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d
> >
> >
> >
> > On Wed, May 21, 2025 at 7:35 AM Michael Collado <collado.m...@gmail.com>
> > wrote:
> >
> >> Hmm, we do use the realm tag in our metric publishing. I understand the
> >> concern re: cardinality. Maybe we can support filtering metrics that
> have
> >> realm and support another metric without realm?
> >>
> >> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <di...@apache.org>
> >> wrote:
> >>
> >>> Removing realm_id from metrics tags makes sense to me (to avoid high
> >>> cardinality).
> >>>
> >>> If we need to have insight into load differences from realm to realm,
> it
> >>> might be preferable to introduce metrics dedicated to that rather than
> >>> increasing the cardinality of every endpoint metric.
> >>>
> >>> Cheers,
> >>> Dmitri.
> >>>
> >>> On Thu, May 15, 2025 at 3:30 PM Alex Dutra
> <alex.du...@dremio.com.invalid
> >>>
> >>> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I would like to suggest removing the "realm_id" metric tag entirely.
> >>>>
> >>>> My concern is that this tag has the potential for high cardinality,
> >> which
> >>>> is generally considered a bad practice when dealing with metrics. High
> >>>> cardinality can lead to performance issues and increased memory usage.
> >>>>
> >>>> Granted, the default realm resolver in Polaris is tailored for just a
> >>>> handful of realms, but nothing prevents users from declaring hundreds
> >> of
> >>>> realms.
> >>>>
> >>>> I believe we can still effectively monitor Polaris servers without
> this
> >>>> specific tag, since the realm ID is also propagated in traces emitted
> >> by
> >>>> Polaris. Tracing is a much better fit for high-cardinality domains.
> >>>>
> >>>> I'm open to discussing this further; a potential alternative would be
> >> to
> >>>> introduce a flag to disable this specific metric tag, but I feel like
> >>>> removing it would be a much cleaner approach.
> >>>>
> >>>> Let me know your thoughts.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Alex
> >>>>
> --
> Robert Stupp
> @snazy
>
>

Re: [DISCUSS] Remove realm_id metric tag

Reply via email to