Hi Alex

Thanks ! (I will take a look at the PR). Then it looks good to me.

Regards
JB

On Mon, May 26, 2025 at 5:33 PM Alex Dutra
<alex.du...@dremio.com.invalid> wrote:
>
> Hi JB, hi all,
>
> My PR already has a safety net similar to what you propose: if the
> cardinality of realm IDs in HTTP metrics goes above a configurable
> threshold (100 by default), a warning is printed and no more HTTP
> metrics will be recorded.
>
> Thanks,
> Alex
>
> On Mon, May 26, 2025 at 5:22 PM Dmitri Bourlatchkov <di...@apache.org> wrote:
> >
> > If we really need a safety valve, add a boolean flag like
> > "metrics.realm-id.enabled", default true. Operators who don’t want the
> > dimension can flip it off; everyone else keeps the out-of-the-box
> > visibility that keeps multi-tenant ops sane.
> >
> >
> > This is exactly what PR [1662] proposes, right?
> >
> > The "breaking change" in this PR is disabling realm_id tags by default.
> > However, users can still enable them in their specific configuration, if
> > they prefer.
> >
> > [1662]
> > https://github.com/apache/polaris/pull/1662/files#diff-bc38eee76420b33ad39904679f65ff2478cc0d4364674fe26a87561dc014b639R56-R57
> >
> > Cheers,
> > Dmitri.
> >
> > On Fri, May 23, 2025 at 6:22 PM Yufei Gu <flyrain...@gmail.com> wrote:
> >
> > > Hey folks,
> > >
> > > I’m not sold on dropping realm_id.
> > >
> > > Why it matters
> > >
> > >    1. It's pretty useful for first-line triage in multi-tenant clusters.
> > >    When a latency spike or error burst hits the dashboards, the first SRE
> > >    question is “Which realm(s)?” Having the tag lets us slice the
> > > time-series
> > >    instantly instead of hopping tools or waiting for sampled traces.
> > >    2. Cardinality is bounded in practice. Even big deployments seldom 
> > > break
> > >    a few dozen active realms per Polaris instance. If someone configures
> > >    “hundreds,” that’s an intentional choice we should surface and monitor.
> > >    Some TSDBs scale just fine in that situation.
> > >    3. Metrics and traces serve different loops. Traces are for deep-dive
> > >    forensics once you have a suspect request; metrics drive real-time
> > > alerting
> > >    and capacity planning. Relying only on traces every on-call to see 
> > > which
> > >    tenant is noisy.
> > >
> > > If we really need a safety valve, add a boolean flag like
> > > "metrics.realm-id.enabled", default true. Operators who don’t want the
> > > dimension can flip it off; everyone else keeps the out-of-the-box
> > > visibility that keeps multi-tenant ops sane.
> > > Dropping the tag outright trades day-to-day operability for a theoretical
> > > scaling worry most users will never hit. Let’s keep it and offer an 
> > > opt-out
> > > instead.
> > >
> > > Yufei
> > >
> > >
> > > On Fri, May 23, 2025 at 5:49 AM Alex Dutra <alex.du...@dremio.com.invalid>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > For reference the PR is here:
> > > https://github.com/apache/polaris/pull/1662
> > > >
> > > > Alex
> > > >
> > > > On Fri, May 23, 2025 at 2:42 PM Robert Stupp <sn...@snazy.de> wrote:
> > > > >
> > > > > +1 on adding a config and default to false - and later entirely remove
> > > > > the "issue".
> > > > >
> > > > > Sounds like good plan to me!
> > > > >
> > > > > On 23.05.25 13:52, Alex Dutra wrote:
> > > > > > Hi Mike,
> > > > > >
> > > > > > That sounds like a plan! And yes, I agree with the slow approach. I
> > > > > > will open a PR shortly with the new enablement flag.
> > > > > >
> > > > > > Alex
> > > > > >
> > > > > > On Fri, May 23, 2025 at 2:42 AM Michael Collado <
> > > > collado.m...@gmail.com> wrote:
> > > > > >> My immediate reaction is that a configuration flag would be good
> > > > enough for
> > > > > >> us. Our setup prevents requests without a valid realm, so we're not
> > > > subject
> > > > > >> to the DDOS attack you mentioned, but we should address that in the
> > > > default
> > > > > >> realm context resolver as well. I can do some further investigation
> > > > to see
> > > > > >> if it's something we could get rid of entirely (we also use
> > > log-based
> > > > > >> metrics), but I like the slow approach of adding a configuration
> > > flag,
> > > > > >> disabling it by default, then removing functionality as a best
> > > > practice,
> > > > > >> anyway.
> > > > > >>
> > > > > >> Mike
> > > > > >>
> > > > > >> On Thu, May 22, 2025 at 11:39 AM Alex Dutra
> > > > <alex.du...@dremio.com.invalid>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi again,
> > > > > >>>
> > > > > >>> Let's go back to the main concern: excessive memory required for
> > > > metrics.
> > > > > >>>
> > > > > >>> To move this discussion forward, would you all agree to introduce 
> > > > > >>> a
> > > > > >>> flag that activates the inclusion of realm IDs in the metrics?
> > > > > >>>
> > > > > >>> It seems like this could be a good middle ground that addresses
> > > > > >>> everyone's needs.
> > > > > >>>
> > > > > >>> Let me know your thoughts.
> > > > > >>>
> > > > > >>> Alex
> > > > > >>>
> > > > > >>>
> > > > > >>> On Wed, May 21, 2025 at 3:57 PM Robert Stupp <sn...@snazy.de>
> > > wrote:
> > > > > >>>> Ouch! That sounds very simple to exploit and quite serious.
> > > > > >>>>
> > > > > >>>> On 21.05.25 15:38, Alex Dutra wrote:
> > > > > >>>>> Also worth noting: http_server_requests_seconds_* metrics are
> > > > generated
> > > > > >>>>> even for failed requests. So it would be very easy for an
> > > attacker
> > > > to
> > > > > >>> forge
> > > > > >>>>> HTTP requests containing invalid realms and take the server 
> > > > > >>>>> down.
> > > > > >>>>>
> > > > > >>>>> Alex
> > > > > >>>>>
> > > > > >>>>> On Wed, May 21, 2025 at 3:04 PM Robert Stupp <sn...@snazy.de>
> > > > wrote:
> > > > > >>>>>
> > > > > >>>>>> OOMs are bad - it's extremely hard (if not impossible) for any
> > > > user to
> > > > > >>>>>> figure out why that happens. Bug reports would just read like
> > > > "Polaris
> > > > > >>>>>> runs into an OOM - no further information available".
> > > > > >>>>>>
> > > > > >>>>>> IMHO the system should be stable by default, not the other way
> > > > around.
> > > > > >>>>>> Even if there is a way to enable such huge cardinalities, that
> > > > flag
> > > > > >>>>>> should really be hidden and not documented anywhere - unless
> > > > there's a
> > > > > >>>>>> clear disclaimer saying: "Enabling this can lead to an
> > > > unresponsive
> > > > > >>>>>> system and risk the stability of the system. ..."
> > > > > >>>>>>
> > > > > >>>>>> I propose to investigate all metrics and reduce the cardinality
> > > > of all
> > > > > >>>>>> of those as that can easily become a serious issue.
> > > > > >>>>>>
> > > > > >>>>>> On 21.05.25 11:42, Alex Dutra wrote:
> > > > > >>>>>>> Hi Mike,
> > > > > >>>>>>>
> > > > > >>>>>>> If you have around a hundred realms or more, imho you are
> > > > already in
> > > > > >>>>>>> trouble.
> > > > > >>>>>>>
> > > > > >>>>>>> The most problematic metric is the
> > > http_server_requests_seconds_*
> > > > > >>> metrics
> > > > > >>>>>>> family, e.g.:
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>
> > > >
> > > http_server_requests_seconds_count{application="Polaris",method="GET",outcome="CLIENT_ERROR",realm_id="POLARIS",status="404",uri="NOT_FOUND",}
> > > > > >>>>>>> 1.0
> > > > > >>>>>>>
> > > > > >>>>>>> Since metrics in this family already have a high cardinality
> > > > > >>> potential
> > > > > >>>>>>> given the number of tags it supports by default, adding one
> > > more
> > > > > >>>>>> dimension
> > > > > >>>>>>> to them makes things (exponentially) worse.
> > > > > >>>>>>>
> > > > > >>>>>>> It's very easy to demonstrate that by running a small test 
> > > > > >>>>>>> [1].
> > > > On my
> > > > > >>>>>>> machine, the first 2 iterations (10 and 100 realms) complete,
> > > but
> > > > > >>> the 3rd
> > > > > >>>>>>> iteration (1000 realms) runs for about 1 minute then ends up 
> > > > > >>>>>>> in
> > > > > >>>>>>> java.lang.OutOfMemoryError: Java heap space.
> > > > > >>>>>>>
> > > > > >>>>>>> That's why I advocated for removing the tag. If however you
> > > > really
> > > > > >>> want
> > > > > >>>>>> to
> > > > > >>>>>>> keep it, I'd suggest introducing a configuration flag to
> > > disable
> > > > it
> > > > > >>> in
> > > > > >>>>>> two
> > > > > >>>>>>> problematic metric families: the HTTP one shown above, and the
> > > > > >>>>>> per-endpoint
> > > > > >>>>>>> metrics as well.
> > > > > >>>>>>>
> > > > > >>>>>>> Thanks,
> > > > > >>>>>>>
> > > > > >>>>>>> Alex
> > > > > >>>>>>>
> > > > > >>>>>>> [1]:
> > > > https://gist.github.com/adutra/414fe773e8727304b34e9249299c988d
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> On Wed, May 21, 2025 at 7:35 AM Michael Collado <
> > > > > >>> collado.m...@gmail.com>
> > > > > >>>>>>> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> Hmm, we do use the realm tag in our metric publishing. I
> > > > understand
> > > > > >>> the
> > > > > >>>>>>>> concern re: cardinality. Maybe we can support filtering
> > > metrics
> > > > that
> > > > > >>>>>> have
> > > > > >>>>>>>> realm and support another metric without realm?
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Mon, May 19, 2025 at 12:24 PM Dmitri Bourlatchkov <
> > > > > >>> di...@apache.org>
> > > > > >>>>>>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Removing realm_id from metrics tags makes sense to me (to
> > > avoid
> > > > > >>> high
> > > > > >>>>>>>>> cardinality).
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> If we need to have insight into load differences from realm
> > > to
> > > > > >>> realm,
> > > > > >>>>>> it
> > > > > >>>>>>>>> might be preferable to introduce metrics dedicated to that
> > > > rather
> > > > > >>> than
> > > > > >>>>>>>>> increasing the cardinality of every endpoint metric.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Cheers,
> > > > > >>>>>>>>> Dmitri.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> On Thu, May 15, 2025 at 3:30 PM Alex Dutra
> > > > > >>>>>> <alex.du...@dremio.com.invalid
> > > > > >>>>>>>>> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>> Hi all,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I would like to suggest removing the "realm_id" metric tag
> > > > > >>> entirely.
> > > > > >>>>>>>>>> My concern is that this tag has the potential for high
> > > > > >>> cardinality,
> > > > > >>>>>>>> which
> > > > > >>>>>>>>>> is generally considered a bad practice when dealing with
> > > > metrics.
> > > > > >>> High
> > > > > >>>>>>>>>> cardinality can lead to performance issues and increased
> > > > memory
> > > > > >>> usage.
> > > > > >>>>>>>>>> Granted, the default realm resolver in Polaris is tailored
> > > for
> > > > > >>> just a
> > > > > >>>>>>>>>> handful of realms, but nothing prevents users from 
> > > > > >>>>>>>>>> declaring
> > > > > >>> hundreds
> > > > > >>>>>>>> of
> > > > > >>>>>>>>>> realms.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> I believe we can still effectively monitor Polaris servers
> > > > without
> > > > > >>>>>> this
> > > > > >>>>>>>>>> specific tag, since the realm ID is also propagated in
> > > traces
> > > > > >>> emitted
> > > > > >>>>>>>> by
> > > > > >>>>>>>>>> Polaris. Tracing is a much better fit for high-cardinality
> > > > > >>> domains.
> > > > > >>>>>>>>>> I'm open to discussing this further; a potential 
> > > > > >>>>>>>>>> alternative
> > > > > >>> would be
> > > > > >>>>>>>> to
> > > > > >>>>>>>>>> introduce a flag to disable this specific metric tag, but I
> > > > feel
> > > > > >>> like
> > > > > >>>>>>>>>> removing it would be a much cleaner approach.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Let me know your thoughts.
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Thanks,
> > > > > >>>>>>>>>>
> > > > > >>>>>>>>>> Alex
> > > > > >>>>>>>>>>
> > > > > >>>>>> --
> > > > > >>>>>> Robert Stupp
> > > > > >>>>>> @snazy
> > > > > >>>>>>
> > > > > >>>>>>
> > > > > >>>> --
> > > > > >>>> Robert Stupp
> > > > > >>>> @snazy
> > > > > >>>>
> > > > > --
> > > > > Robert Stupp
> > > > > @snazy
> > > > >
> > > >
> > >

Reply via email to