> So really it boils down to, what's a "node" and how do you count them? 
 Is a single "node" a whole cluster, or is a cluster a collection of nodes?

A node is a redis service that is part of a cluster (id'ed by the `group` 
label), so a cluster is a collection of nodes. The sum of all nodes is a 
determinate and, under normal circumstances, a static value but since a 
redis 'node' is never forgotten unless told to I want to alert on this case 
since it can skew the interpolation of other metrics.

> In particular, what do these metrics mean?
> 
> redis_cluster_known_nodes{group="group-a", instance="node-1", 
job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
> redis_cluster_known_nodes{group="group-a", instance="node-2", 
job="redis-cluster", service="exporter-redis-6379"} 11
> redis_cluster_known_nodes{group="group-a", instance="node-3", 
job="redis-cluster", service="exporter-redis-6379"} 16
> redis_cluster_known_nodes{group="group-a", instance="node-4", 
job="redis-cluster", service="exporter-redis-6379"} 16
> redis_cluster_known_nodes{group="group-a", instance="node-5", 
job="redis-cluster", service="exporter-redis-6379"} 16

This represents the state of all known redis nodes belonging to a single 
cluster relative to a running node.

> They are all the same "service", but how come instance "node-1" contains 
or sees 10 "nodes", but instance "node-2" contains or sees 11 "nodes", and 
the other instances contain or see 16 "nodes"?  Perhaps this inconsistency 
is the error you're trying to detect - in which case, what do you think is 
the correct number of nodes?

This is indeed the scenario I'm attempting to query for. In this case; when 
a node is joined to the cluster but is unreachable for any reason (ie: 
redis is uninstalled / re-installed and the node rejoins the cluster) the 
node's ID changes (the new ID is valid and reachable, the old ID is no 
longer valid and unreachable).

The correct value is 10: 5 `instance`'s x 2 `service`'s

> Let's say 16 is the correct answer for group="group-a" and 
service="exporter-redis-6379".  Perhaps you didn't show the full set of 
"up" metrics.  In which case, I'd first try to build an "up" query which 
gives the expected answer 16 on the right-hand side.  Maybe something like 
this:
>
>     count by (service, group) (up{service=~"exporter-redis-.*"})
>
> What does that expression show?

{group="group-a", service="exporter-redis-6379"} 5
{group="group-a", service="exporter-redis-6380"} 5

> When you have that part working, then we can work on matching the LHS. 
 Since each *instance* seems to have its own distinct idea of the total 
number of nodes, then I expect this requires an N:1 match on 
(group,service).  That is, there is 1 "should be" value for a given 
(service,group) on the RHS, and multiple nodes each with their own count of 
(service,group) on the LHS.

That sounds accurate

> If that's the case, it might end up something like this:
> 
>     redis_cluster_known_nodes != on (service, group) group left() count 
by (service, group) (up{service=~"exporter-redis-.*"})
> 
> but at this point I'm just speculating.

This gives the same result as before.

I'll keep plugging away at this to see what I can come up with.

On Tuesday, October 18, 2022 at 3:36:49 AM UTC-4 Brian Candler wrote:

> Sorry, I missed an underscore there.
>
>    redis_cluster_known_nodes != on (service, group) *group_left*() count 
> by (service, group) (up{service=~"exporter-redis-.*"})
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/c4e11aaa-30cd-42f5-a902-f866f977d1f9n%40googlegroups.com.

Reply via email to