Hi again, Let's go back to the main concern: excessive memory required for metrics.
To move this discussion forward, would you all agree to introduce a flag that activates the inclusion of realm IDs in the metrics? It seems like this could be a good middle ground that addresses everyone's needs. Let me know your thoughts. Alex On Wed, May 21, 2025 at 3:57 PM Robert Stupp <sn...@snazy.de> wrote: > > Ouch! That sounds very simple to exploit and quite serious. > > On 21.05.25 15:38, Alex Dutra wrote: > > Also worth noting: http_server_requests_seconds_* metrics are generated > > even for failed requests. So it would be very easy for an attacker to forge > > HTTP requests containing invalid realms and take the server down. > > > > Alex > > > > On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de> wrote: > > > >> OOMs are bad - it's extremely hard (if not impossible) for any user to > >> figure out why that happens. Bug reports would just read like "Polaris > >> runs into an OOM - no further information available". > >> > >> IMHO the system should be stable by default, not the other way around. > >> Even if there is a way to enable such huge cardinalities, that flag > >> should really be hidden and not documented anywhere - unless there's a > >> clear disclaimer saying: "Enabling this can lead to an unresponsive > >> system and risk the stability of the system. ..." > >> > >> I propose to investigate all metrics and reduce the cardinality of all > >> of those as that can easily become a serious issue. > >> > >> On 21.05.25 11:42, Alex Dutra wrote: > >>> Hi Mike, > >>> > >>> If you have around a hundred realms or more, imho you are already in > >>> trouble. > >>> > >>> The most problematic metric is the http_server_requests_seconds_* metrics > >>> family, e.g.: > >>> > >>> > >> http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",} > >>> 1.0 > >>> > >>> Since metrics in this family already have a high cardinality potential > >>> given the number of tags it supports by default, adding one more > >> dimension > >>> to them makes things (exponentially) worse. > >>> > >>> It's very easy to demonstrate that by running a small test [1]. On my > >>> machine, the first 2 iterations (10 and 100 realms) complete, but the 3rd > >>> iteration (1000 realms) runs for about 1 minute then ends up in > >>> java.lang.OutOfMemoryError: Java heap space. > >>> > >>> That's why I advocated for removing the tag. If however you really want > >> to > >>> keep it, I'd suggest introducing a configuration flag to disable it in > >> two > >>> problematic metric families: the HTTP one shown above, and the > >> per-endpoint > >>> metrics as well. > >>> > >>> Thanks, > >>> > >>> Alex > >>> > >>> [1]: https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d > >>> > >>> > >>> > >>> On Wed, May 21, 2025 at 7:35 AM Michael Collado <collado.m...@gmail.com> > >>> wrote: > >>> > >>>> Hmm, we do use the realm tag in our metric publishing. I understand the > >>>> concern re: cardinality. Maybe we can support filtering metrics that > >> have > >>>> realm and support another metric without realm? > >>>> > >>>> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <di...@apache.org> > >>>> wrote: > >>>> > >>>>> Removing realm_id from metrics tags makes sense to me (to avoid high > >>>>> cardinality). > >>>>> > >>>>> If we need to have insight into load differences from realm to realm, > >> it > >>>>> might be preferable to introduce metrics dedicated to that rather than > >>>>> increasing the cardinality of every endpoint metric. > >>>>> > >>>>> Cheers, > >>>>> Dmitri. > >>>>> > >>>>> On Thu, May 15, 2025 at 3:30 PM Alex Dutra > >> <alex.du...@dremio.com.invalid > >>>>> wrote: > >>>>> > >>>>>> Hi all, > >>>>>> > >>>>>> I would like to suggest removing the "realm_id" metric tag entirely. > >>>>>> > >>>>>> My concern is that this tag has the potential for high cardinality, > >>>> which > >>>>>> is generally considered a bad practice when dealing with metrics. High > >>>>>> cardinality can lead to performance issues and increased memory usage. > >>>>>> > >>>>>> Granted, the default realm resolver in Polaris is tailored for just a > >>>>>> handful of realms, but nothing prevents users from declaring hundreds > >>>> of > >>>>>> realms. > >>>>>> > >>>>>> I believe we can still effectively monitor Polaris servers without > >> this > >>>>>> specific tag, since the realm ID is also propagated in traces emitted > >>>> by > >>>>>> Polaris. Tracing is a much better fit for high-cardinality domains. > >>>>>> > >>>>>> I'm open to discussing this further; a potential alternative would be > >>>> to > >>>>>> introduce a flag to disable this specific metric tag, but I feel like > >>>>>> removing it would be a much cleaner approach. > >>>>>> > >>>>>> Let me know your thoughts. > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Alex > >>>>>> > >> -- > >> Robert Stupp > >> @snazy > >> > >> > -- > Robert Stupp > @snazy >