Hi Alex Thanks ! (I will take a look at the PR). Then it looks good to me.
Regards JB On Mon, May 26, 2025 at 5:33 PM Alex Dutra <alex.du...@dremio.com.invalid> wrote: > > Hi JB, hi all, > > My PR already has a safety net similar to what you propose: if the > cardinality of realm IDs in HTTP metrics goes above a configurable > threshold (100 by default), a warning is printed and no more HTTP > metrics will be recorded. > > Thanks, > Alex > > On Mon, May 26, 2025 at 5:22 PM Dmitri Bourlatchkov <di...@apache.org> wrote: > > > > If we really need a safety valve, add a boolean flag like > > "metrics.realm-id.enabled", default true. Operators who don’t want the > > dimension can flip it off; everyone else keeps the out-of-the-box > > visibility that keeps multi-tenant ops sane. > > > > > > This is exactly what PR [1662] proposes, right? > > > > The "breaking change" in this PR is disabling realm_id tags by default. > > However, users can still enable them in their specific configuration, if > > they prefer. > > > > [1662] > > https://github.com/apache/polaris/pull/1662/files#diff-bc38eee76420b33ad39904679f65ff2478cc0d4364674fe26a87561dc014b639R56-R57 > > > > Cheers, > > Dmitri. > > > > On Fri, May 23, 2025 at 6:22 PM Yufei Gu <flyrain...@gmail.com> wrote: > > > > > Hey folks, > > > > > > I’m not sold on dropping realm_id. > > > > > > Why it matters > > > > > > 1. It's pretty useful for first-line triage in multi-tenant clusters. > > > When a latency spike or error burst hits the dashboards, the first SRE > > > question is “Which realm(s)?” Having the tag lets us slice the > > > time-series > > > instantly instead of hopping tools or waiting for sampled traces. > > > 2. Cardinality is bounded in practice. Even big deployments seldom > > > break > > > a few dozen active realms per Polaris instance. If someone configures > > > “hundreds,” that’s an intentional choice we should surface and monitor. > > > Some TSDBs scale just fine in that situation. > > > 3. Metrics and traces serve different loops. Traces are for deep-dive > > > forensics once you have a suspect request; metrics drive real-time > > > alerting > > > and capacity planning. Relying only on traces every on-call to see > > > which > > > tenant is noisy. > > > > > > If we really need a safety valve, add a boolean flag like > > > "metrics.realm-id.enabled", default true. Operators who don’t want the > > > dimension can flip it off; everyone else keeps the out-of-the-box > > > visibility that keeps multi-tenant ops sane. > > > Dropping the tag outright trades day-to-day operability for a theoretical > > > scaling worry most users will never hit. Let’s keep it and offer an > > > opt-out > > > instead. > > > > > > Yufei > > > > > > > > > On Fri, May 23, 2025 at 5:49 AM Alex Dutra <alex.du...@dremio.com.invalid> > > > wrote: > > > > > > > Hi, > > > > > > > > For reference the PR is here: > > > https://github.com/apache/polaris/pull/1662 > > > > > > > > Alex > > > > > > > > On Fri, May 23, 2025 at 2:42 PM Robert Stupp <sn...@snazy.de> wrote: > > > > > > > > > > +1 on adding a config and default to false - and later entirely remove > > > > > the "issue". > > > > > > > > > > Sounds like good plan to me! > > > > > > > > > > On 23.05.25 13:52, Alex Dutra wrote: > > > > > > Hi Mike, > > > > > > > > > > > > That sounds like a plan! And yes, I agree with the slow approach. I > > > > > > will open a PR shortly with the new enablement flag. > > > > > > > > > > > > Alex > > > > > > > > > > > > On Fri, May 23, 2025 at 2:42 AM Michael Collado < > > > > collado.m...@gmail.com> wrote: > > > > > >> My immediate reaction is that a configuration flag would be good > > > > enough for > > > > > >> us. Our setup prevents requests without a valid realm, so we're not > > > > subject > > > > > >> to the DDOS attack you mentioned, but we should address that in the > > > > default > > > > > >> realm context resolver as well. I can do some further investigation > > > > to see > > > > > >> if it's something we could get rid of entirely (we also use > > > log-based > > > > > >> metrics), but I like the slow approach of adding a configuration > > > flag, > > > > > >> disabling it by default, then removing functionality as a best > > > > practice, > > > > > >> anyway. > > > > > >> > > > > > >> Mike > > > > > >> > > > > > >> On Thu, May 22, 2025 at 11:39 AM Alex Dutra > > > > <alex.du...@dremio.com.invalid> > > > > > >> wrote: > > > > > >> > > > > > >>> Hi again, > > > > > >>> > > > > > >>> Let's go back to the main concern: excessive memory required for > > > > metrics. > > > > > >>> > > > > > >>> To move this discussion forward, would you all agree to introduce > > > > > >>> a > > > > > >>> flag that activates the inclusion of realm IDs in the metrics? > > > > > >>> > > > > > >>> It seems like this could be a good middle ground that addresses > > > > > >>> everyone's needs. > > > > > >>> > > > > > >>> Let me know your thoughts. > > > > > >>> > > > > > >>> Alex > > > > > >>> > > > > > >>> > > > > > >>> On Wed, May 21, 2025 at 3:57 PM Robert Stupp <sn...@snazy.de> > > > wrote: > > > > > >>>> Ouch! That sounds very simple to exploit and quite serious. > > > > > >>>> > > > > > >>>> On 21.05.25 15:38, Alex Dutra wrote: > > > > > >>>>> Also worth noting: http_server_requests_seconds_* metrics are > > > > generated > > > > > >>>>> even for failed requests. So it would be very easy for an > > > attacker > > > > to > > > > > >>> forge > > > > > >>>>> HTTP requests containing invalid realms and take the server > > > > > >>>>> down. > > > > > >>>>> > > > > > >>>>> Alex > > > > > >>>>> > > > > > >>>>> On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de> > > > > wrote: > > > > > >>>>> > > > > > >>>>>> OOMs are bad - it's extremely hard (if not impossible) for any > > > > user to > > > > > >>>>>> figure out why that happens. Bug reports would just read like > > > > "Polaris > > > > > >>>>>> runs into an OOM - no further information available". > > > > > >>>>>> > > > > > >>>>>> IMHO the system should be stable by default, not the other way > > > > around. > > > > > >>>>>> Even if there is a way to enable such huge cardinalities, that > > > > flag > > > > > >>>>>> should really be hidden and not documented anywhere - unless > > > > there's a > > > > > >>>>>> clear disclaimer saying: "Enabling this can lead to an > > > > unresponsive > > > > > >>>>>> system and risk the stability of the system. ..." > > > > > >>>>>> > > > > > >>>>>> I propose to investigate all metrics and reduce the cardinality > > > > of all > > > > > >>>>>> of those as that can easily become a serious issue. > > > > > >>>>>> > > > > > >>>>>> On 21.05.25 11:42, Alex Dutra wrote: > > > > > >>>>>>> Hi Mike, > > > > > >>>>>>> > > > > > >>>>>>> If you have around a hundred realms or more, imho you are > > > > already in > > > > > >>>>>>> trouble. > > > > > >>>>>>> > > > > > >>>>>>> The most problematic metric is the > > > http_server_requests_seconds_* > > > > > >>> metrics > > > > > >>>>>>> family, e.g.: > > > > > >>>>>>> > > > > > >>>>>>> > > > > > >>> > > > > > > > http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",} > > > > > >>>>>>> 1.0 > > > > > >>>>>>> > > > > > >>>>>>> Since metrics in this family already have a high cardinality > > > > > >>> potential > > > > > >>>>>>> given the number of tags it supports by default, adding one > > > more > > > > > >>>>>> dimension > > > > > >>>>>>> to them makes things (exponentially) worse. > > > > > >>>>>>> > > > > > >>>>>>> It's very easy to demonstrate that by running a small test > > > > > >>>>>>> [1]. > > > > On my > > > > > >>>>>>> machine, the first 2 iterations (10 and 100 realms) complete, > > > but > > > > > >>> the 3rd > > > > > >>>>>>> iteration (1000 realms) runs for about 1 minute then ends up > > > > > >>>>>>> in > > > > > >>>>>>> java.lang.OutOfMemoryError: Java heap space. > > > > > >>>>>>> > > > > > >>>>>>> That's why I advocated for removing the tag. If however you > > > > really > > > > > >>> want > > > > > >>>>>> to > > > > > >>>>>>> keep it, I'd suggest introducing a configuration flag to > > > disable > > > > it > > > > > >>> in > > > > > >>>>>> two > > > > > >>>>>>> problematic metric families: the HTTP one shown above, and the > > > > > >>>>>> per-endpoint > > > > > >>>>>>> metrics as well. > > > > > >>>>>>> > > > > > >>>>>>> Thanks, > > > > > >>>>>>> > > > > > >>>>>>> Alex > > > > > >>>>>>> > > > > > >>>>>>> [1]: > > > > https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d > > > > > >>>>>>> > > > > > >>>>>>> > > > > > >>>>>>> > > > > > >>>>>>> On Wed, May 21, 2025 at 7:35 AM Michael Collado < > > > > > >>> collado.m...@gmail.com> > > > > > >>>>>>> wrote: > > > > > >>>>>>> > > > > > >>>>>>>> Hmm, we do use the realm tag in our metric publishing. I > > > > understand > > > > > >>> the > > > > > >>>>>>>> concern re: cardinality. Maybe we can support filtering > > > metrics > > > > that > > > > > >>>>>> have > > > > > >>>>>>>> realm and support another metric without realm? > > > > > >>>>>>>> > > > > > >>>>>>>> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov < > > > > > >>> di...@apache.org> > > > > > >>>>>>>> wrote: > > > > > >>>>>>>> > > > > > >>>>>>>>> Removing realm_id from metrics tags makes sense to me (to > > > avoid > > > > > >>> high > > > > > >>>>>>>>> cardinality). > > > > > >>>>>>>>> > > > > > >>>>>>>>> If we need to have insight into load differences from realm > > > to > > > > > >>> realm, > > > > > >>>>>> it > > > > > >>>>>>>>> might be preferable to introduce metrics dedicated to that > > > > rather > > > > > >>> than > > > > > >>>>>>>>> increasing the cardinality of every endpoint metric. > > > > > >>>>>>>>> > > > > > >>>>>>>>> Cheers, > > > > > >>>>>>>>> Dmitri. > > > > > >>>>>>>>> > > > > > >>>>>>>>> On Thu, May 15, 2025 at 3:30 PM Alex Dutra > > > > > >>>>>> <alex.du...@dremio.com.invalid > > > > > >>>>>>>>> wrote: > > > > > >>>>>>>>> > > > > > >>>>>>>>>> Hi all, > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> I would like to suggest removing the "realm_id" metric tag > > > > > >>> entirely. > > > > > >>>>>>>>>> My concern is that this tag has the potential for high > > > > > >>> cardinality, > > > > > >>>>>>>> which > > > > > >>>>>>>>>> is generally considered a bad practice when dealing with > > > > metrics. > > > > > >>> High > > > > > >>>>>>>>>> cardinality can lead to performance issues and increased > > > > memory > > > > > >>> usage. > > > > > >>>>>>>>>> Granted, the default realm resolver in Polaris is tailored > > > for > > > > > >>> just a > > > > > >>>>>>>>>> handful of realms, but nothing prevents users from > > > > > >>>>>>>>>> declaring > > > > > >>> hundreds > > > > > >>>>>>>> of > > > > > >>>>>>>>>> realms. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> I believe we can still effectively monitor Polaris servers > > > > without > > > > > >>>>>> this > > > > > >>>>>>>>>> specific tag, since the realm ID is also propagated in > > > traces > > > > > >>> emitted > > > > > >>>>>>>> by > > > > > >>>>>>>>>> Polaris. Tracing is a much better fit for high-cardinality > > > > > >>> domains. > > > > > >>>>>>>>>> I'm open to discussing this further; a potential > > > > > >>>>>>>>>> alternative > > > > > >>> would be > > > > > >>>>>>>> to > > > > > >>>>>>>>>> introduce a flag to disable this specific metric tag, but I > > > > feel > > > > > >>> like > > > > > >>>>>>>>>> removing it would be a much cleaner approach. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Let me know your thoughts. > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Thanks, > > > > > >>>>>>>>>> > > > > > >>>>>>>>>> Alex > > > > > >>>>>>>>>> > > > > > >>>>>> -- > > > > > >>>>>> Robert Stupp > > > > > >>>>>> @snazy > > > > > >>>>>> > > > > > >>>>>> > > > > > >>>> -- > > > > > >>>> Robert Stupp > > > > > >>>> @snazy > > > > > >>>> > > > > > -- > > > > > Robert Stupp > > > > > @snazy > > > > > > > > > > > >