Or even:
redis_cluster_known_nodes != redis_cluster_known_nodes offset 5m

On Tuesday, 18 October 2022 at 20:12:27 UTC+1 marc.k...@gmail.com wrote:

> Perhaps an easier option would be to compare redis_cluster_known_nodes 
> against what it was n-time_interval_ago:
> redis_cluster_known_nodes != avg_over_time 
> (redis_cluster_known_nodes[1d:4h])
>
> It's less-than ideal since it's not using a static,expected value of total 
> cluster nodes and it would match when the cluster nodes become what is 
> expected but I can deal with that for now.
>
> Thanks for your help! 
>
> On Tuesday, October 18, 2022 at 7:27:10 AM UTC-4 marc koser wrote:
>
>> > So really it boils down to, what's a "node" and how do you count them? 
>>  Is a single "node" a whole cluster, or is a cluster a collection of nodes?
>>
>> A node is a redis service that is part of a cluster (id'ed by the `group` 
>> label), so a cluster is a collection of nodes. The sum of all nodes is a 
>> determinate and, under normal circumstances, a static value but since a 
>> redis 'node' is never forgotten unless told to I want to alert on this case 
>> since it can skew the interpolation of other metrics.
>>
>> > In particular, what do these metrics mean?
>> > 
>> > redis_cluster_known_nodes{group="group-a", instance="node-1", 
>> job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
>> > redis_cluster_known_nodes{group="group-a", instance="node-2", 
>> job="redis-cluster", service="exporter-redis-6379"} 11
>> > redis_cluster_known_nodes{group="group-a", instance="node-3", 
>> job="redis-cluster", service="exporter-redis-6379"} 16
>> > redis_cluster_known_nodes{group="group-a", instance="node-4", 
>> job="redis-cluster", service="exporter-redis-6379"} 16
>> > redis_cluster_known_nodes{group="group-a", instance="node-5", 
>> job="redis-cluster", service="exporter-redis-6379"} 16
>>
>> This represents the state of all known redis nodes belonging to a single 
>> cluster relative to a running node.
>>
>> > They are all the same "service", but how come instance "node-1" 
>> contains or sees 10 "nodes", but instance "node-2" contains or sees 11 
>> "nodes", and the other instances contain or see 16 "nodes"?  Perhaps this 
>> inconsistency is the error you're trying to detect - in which case, what do 
>> you think is the correct number of nodes?
>>
>> This is indeed the scenario I'm attempting to query for. In this case; 
>> when a node is joined to the cluster but is unreachable for any reason (ie: 
>> redis is uninstalled / re-installed and the node rejoins the cluster) the 
>> node's ID changes (the new ID is valid and reachable, the old ID is no 
>> longer valid and unreachable).
>>
>> The correct value is 10: 5 `instance`'s x 2 `service`'s
>>
>> > Let's say 16 is the correct answer for group="group-a" and 
>> service="exporter-redis-6379".  Perhaps you didn't show the full set of 
>> "up" metrics.  In which case, I'd first try to build an "up" query which 
>> gives the expected answer 16 on the right-hand side.  Maybe something like 
>> this:
>> >
>> >     count by (service, group) (up{service=~"exporter-redis-.*"})
>> >
>> > What does that expression show?
>>
>> {group="group-a", service="exporter-redis-6379"} 5
>> {group="group-a", service="exporter-redis-6380"} 5
>>
>> > When you have that part working, then we can work on matching the LHS. 
>>  Since each *instance* seems to have its own distinct idea of the total 
>> number of nodes, then I expect this requires an N:1 match on 
>> (group,service).  That is, there is 1 "should be" value for a given 
>> (service,group) on the RHS, and multiple nodes each with their own count of 
>> (service,group) on the LHS.
>>
>> That sounds accurate
>>
>> > If that's the case, it might end up something like this:
>> > 
>> >     redis_cluster_known_nodes != on (service, group) group left() count 
>> by (service, group) (up{service=~"exporter-redis-.*"})
>> > 
>> > but at this point I'm just speculating.
>>
>> This gives the same result as before.
>>
>> I'll keep plugging away at this to see what I can come up with.
>>
>> On Tuesday, October 18, 2022 at 3:36:49 AM UTC-4 Brian Candler wrote:
>>
>>> Sorry, I missed an underscore there.
>>>
>>>    redis_cluster_known_nodes != on (service, group) *group_left*() 
>>> count by (service, group) (up{service=~"exporter-redis-.*"})
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0280feae-1f5f-427f-9000-6803626be449n%40googlegroups.com.

Reply via email to