Hello team,

We are currently evaluating Thanos as a solution for horizontally scaling 
our Prometheus setup. 

For Global rule evaluation with Thanos ruler, one has to make a tradeoff 
between availability and accuracy. For our use case, we favor accuracy 
compared to availability. But wondering if the tradeoff with availability 
can be improved

Thanos querier declares a response is partial when atleast one instance 
exposing Store APIs is down. Systems preferring accuracy will “abort” rule 
evaluations during partial responses. But considering a typical Prometheus 
HA setup contains replicas of Prometheus instances ,  it’s very 
inconvenient to abort alert rule evaluations every time any single replica 
is down. Any one instance could be down for various reasons(scheduled 
maintenance, patching, deployment etc).

Is there any way to improve the availability of Global alert rules? 

Does it make sense to enhance the store APIs to be replica-aware? During 
partial responses, Can the querier indicate if there is an error in 
retrieving data from all replicas or the error is in receiving data from 
only subset of them.

Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/9a22b98a-dc9e-4d16-aeac-004a677675fbn%40googlegroups.com.

Reply via email to