Re: [prometheus-users] Thanos - partial response behavior

Bartłomiej Płotka Wed, 02 Jun 2021 14:27:20 -0700

Hi, this question might better be suited for either
https://github.com/thanos-io/thanos Github Discussion/Issue of #thanos
Slack channel.


Thanos querier declares a response is partial when atleast one instance
> exposing Store APIs is down. Systems preferring accuracy will “abort” rule
> evaluations during partial responses. But considering a typical Prometheus
> HA setup contains replicas of Prometheus instances ,  it’s very
> inconvenient to abort alert rule evaluations every time any single replica
> is down. Any one instance could be down for various reasons(scheduled
> maintenance, patching, deployment etc).
>
> Is there any way to improve the availability of Global alert rules?
>
> Does it make sense to enhance the store APIs to be replica-aware? During
> partial responses, Can the querier indicate if there is an error in
> retrieving data from all replicas or the error is in receiving data from
> only subset of them.
>
Thanks, good idea, but unfortunately it's not that simple for Prometheus.
The reason is that the data each Prometheus replica stores is not
replicated between each instance. This means that if one instance is down
and then up, even if queried together shows data without gaps, if we query
one, we can still see gaps.

There are few things one can do:

* Deploy more than two Prometheus instances (3+) and ensure some store API
replication understands if we don't see one it's fine, if we don't see two
it's aborted. This would need to ensure that all rollouts are gradual too.
If you are interested in such flow, it would be valid to request on Thanos
GH issues, and wouldn't be hard to implement.
* Don't do global alert rules, do most of them locally so alerts are close
to data source.
* If you really want to have more availability, remote write might be a
better approach as Thanos Receivers can easily do 3x or greater replication
(needed for ingestion part). Since alerting is then on top of receivers
with data within same network, the querying path is more reliable too. You
can of course mix those two deployments models (for some clusters use
receive, for some use Prom+sidecar).

Kind Regards,
Bartek Płotka (@bwplotka)


On Wed, Jun 2, 2021 at 8:06 PM Karthik J <[email protected]> wrote:

> Hello team,
>
> We are currently evaluating Thanos as a solution for horizontally scaling
> our Prometheus setup.
>
> For Global rule evaluation with Thanos ruler, one has to make a tradeoff
> between availability and accuracy. For our use case, we favor accuracy
> compared to availability. But wondering if the tradeoff with availability
> can be improved
>
> Thanos querier declares a response is partial when atleast one instance
> exposing Store APIs is down. Systems preferring accuracy will “abort” rule
> evaluations during partial responses. But considering a typical Prometheus
> HA setup contains replicas of Prometheus instances ,  it’s very
> inconvenient to abort alert rule evaluations every time any single replica
> is down. Any one instance could be down for various reasons(scheduled
> maintenance, patching, deployment etc).
>
> Is there any way to improve the availability of Global alert rules?
>
> Does it make sense to enhance the store APIs to be replica-aware? During
> partial responses, Can the querier indicate if there is an error in
> retrieving data from all replicas or the error is in receiving data from
> only subset of them.
>
> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/9a22b98a-dc9e-4d16-aeac-004a677675fbn%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/9a22b98a-dc9e-4d16-aeac-004a677675fbn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CAMssQwbRKauP2XfmJjjrS-kUOZatFA0NPaKdR7YsxLjNUt0KRA%40mail.gmail.com.

Re: [prometheus-users] Thanos - partial response behavior

Reply via email to