Hi again,

Let's go back to the main concern: excessive memory required for metrics.

To move this discussion forward, would you all agree to introduce a
flag that activates the inclusion of realm IDs in the metrics?

It seems like this could be a good middle ground that addresses
everyone's needs.

Let me know your thoughts.

Alex


On Wed, May 21, 2025 at 3:57 PM Robert Stupp <sn...@snazy.de> wrote:
>
> Ouch! That sounds very simple to exploit and quite serious.
>
> On 21.05.25 15:38, Alex Dutra wrote:
> > Also worth noting: http_server_requests_seconds_* metrics are generated
> > even for failed requests. So it would be very easy for an attacker to forge
> > HTTP requests containing invalid realms and take the server down.
> >
> > Alex
> >
> > On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de> wrote:
> >
> >> OOMs are bad - it's extremely hard (if not impossible) for any user to
> >> figure out why that happens. Bug reports would just read like "Polaris
> >> runs into an OOM - no further information available".
> >>
> >> IMHO the system should be stable by default, not the other way around.
> >> Even if there is a way to enable such huge cardinalities, that flag
> >> should really be hidden and not documented anywhere - unless there's a
> >> clear disclaimer saying: "Enabling this can lead to an unresponsive
> >> system and risk the stability of the system. ..."
> >>
> >> I propose to investigate all metrics and reduce the cardinality of all
> >> of those as that can easily become a serious issue.
> >>
> >> On 21.05.25 11:42, Alex Dutra wrote:
> >>> Hi Mike,
> >>>
> >>> If you have around a hundred realms or more, imho you are already in
> >>> trouble.
> >>>
> >>> The most problematic metric is the http_server_requests_seconds_* metrics
> >>> family, e.g.:
> >>>
> >>>
> >> http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",}
> >>> 1.0
> >>>
> >>> Since metrics in this family already have a high cardinality potential
> >>> given the number of tags it supports by default, adding one more
> >> dimension
> >>> to them makes things (exponentially) worse.
> >>>
> >>> It's very easy to demonstrate that by running a small test [1]. On my
> >>> machine, the first 2 iterations (10 and 100 realms) complete, but the 3rd
> >>> iteration (1000 realms) runs for about 1 minute then ends up in
> >>> java.lang.OutOfMemoryError: Java heap space.
> >>>
> >>> That's why I advocated for removing the tag. If however you really want
> >> to
> >>> keep it, I'd suggest introducing a configuration flag to disable it in
> >> two
> >>> problematic metric families: the HTTP one shown above, and the
> >> per-endpoint
> >>> metrics as well.
> >>>
> >>> Thanks,
> >>>
> >>> Alex
> >>>
> >>> [1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d
> >>>
> >>>
> >>>
> >>> On Wed, May 21, 2025 at 7:35 AM Michael Collado <collado.m...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hmm, we do use the realm tag in our metric publishing. I understand the
> >>>> concern re: cardinality. Maybe we can support filtering metrics that
> >> have
> >>>> realm and support another metric without realm?
> >>>>
> >>>> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <di...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> Removing realm_id from metrics tags makes sense to me (to avoid high
> >>>>> cardinality).
> >>>>>
> >>>>> If we need to have insight into load differences from realm to realm,
> >> it
> >>>>> might be preferable to introduce metrics dedicated to that rather than
> >>>>> increasing the cardinality of every endpoint metric.
> >>>>>
> >>>>> Cheers,
> >>>>> Dmitri.
> >>>>>
> >>>>> On Thu, May 15, 2025 at 3:30 PM Alex Dutra
> >> <alex.du...@dremio.com.invalid
> >>>>> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I would like to suggest removing the "realm_id" metric tag entirely.
> >>>>>>
> >>>>>> My concern is that this tag has the potential for high cardinality,
> >>>> which
> >>>>>> is generally considered a bad practice when dealing with metrics. High
> >>>>>> cardinality can lead to performance issues and increased memory usage.
> >>>>>>
> >>>>>> Granted, the default realm resolver in Polaris is tailored for just a
> >>>>>> handful of realms, but nothing prevents users from declaring hundreds
> >>>> of
> >>>>>> realms.
> >>>>>>
> >>>>>> I believe we can still effectively monitor Polaris servers without
> >> this
> >>>>>> specific tag, since the realm ID is also propagated in traces emitted
> >>>> by
> >>>>>> Polaris. Tracing is a much better fit for high-cardinality domains.
> >>>>>>
> >>>>>> I'm open to discussing this further; a potential alternative would be
> >>>> to
> >>>>>> introduce a flag to disable this specific metric tag, but I feel like
> >>>>>> removing it would be a much cleaner approach.
> >>>>>>
> >>>>>> Let me know your thoughts.
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> Alex
> >>>>>>
> >> --
> >> Robert Stupp
> >> @snazy
> >>
> >>
> --
> Robert Stupp
> @snazy
>

Reply via email to