Hi, this question might better be suited for either https://github.com/thanos-io/thanos Github Discussion/Issue of #thanos Slack channel.
Thanos querier declares a response is partial when atleast one instance > exposing Store APIs is down. Systems preferring accuracy will “abort” rule > evaluations during partial responses. But considering a typical Prometheus > HA setup contains replicas of Prometheus instances , it’s very > inconvenient to abort alert rule evaluations every time any single replica > is down. Any one instance could be down for various reasons(scheduled > maintenance, patching, deployment etc). > > Is there any way to improve the availability of Global alert rules? > > Does it make sense to enhance the store APIs to be replica-aware? During > partial responses, Can the querier indicate if there is an error in > retrieving data from all replicas or the error is in receiving data from > only subset of them. > Thanks, good idea, but unfortunately it's not that simple for Prometheus. The reason is that the data each Prometheus replica stores is not replicated between each instance. This means that if one instance is down and then up, even if queried together shows data without gaps, if we query one, we can still see gaps. There are few things one can do: * Deploy more than two Prometheus instances (3+) and ensure some store API replication understands if we don't see one it's fine, if we don't see two it's aborted. This would need to ensure that all rollouts are gradual too. If you are interested in such flow, it would be valid to request on Thanos GH issues, and wouldn't be hard to implement. * Don't do global alert rules, do most of them locally so alerts are close to data source. * If you really want to have more availability, remote write might be a better approach as Thanos Receivers can easily do 3x or greater replication (needed for ingestion part). Since alerting is then on top of receivers with data within same network, the querying path is more reliable too. You can of course mix those two deployments models (for some clusters use receive, for some use Prom+sidecar). Kind Regards, Bartek Płotka (@bwplotka) On Wed, Jun 2, 2021 at 8:06 PM Karthik J <[email protected]> wrote: > Hello team, > > We are currently evaluating Thanos as a solution for horizontally scaling > our Prometheus setup. > > For Global rule evaluation with Thanos ruler, one has to make a tradeoff > between availability and accuracy. For our use case, we favor accuracy > compared to availability. But wondering if the tradeoff with availability > can be improved > > Thanos querier declares a response is partial when atleast one instance > exposing Store APIs is down. Systems preferring accuracy will “abort” rule > evaluations during partial responses. But considering a typical Prometheus > HA setup contains replicas of Prometheus instances , it’s very > inconvenient to abort alert rule evaluations every time any single replica > is down. Any one instance could be down for various reasons(scheduled > maintenance, patching, deployment etc). > > Is there any way to improve the availability of Global alert rules? > > Does it make sense to enhance the store APIs to be replica-aware? During > partial responses, Can the querier indicate if there is an error in > retrieving data from all replicas or the error is in receiving data from > only subset of them. > > Thanks > > -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/9a22b98a-dc9e-4d16-aeac-004a677675fbn%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/9a22b98a-dc9e-4d16-aeac-004a677675fbn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAMssQwbRKauP2XfmJjjrS-kUOZatFA0NPaKdR7YsxLjNUt0KRA%40mail.gmail.com.

