GitHub user Duansg added a comment to the discussion: Discussion on fingerprint 
generation logic

After some thought, perhaps we could collect this kind of cardinality 
statistics for alert labels more appropriately by using hertzbeat's own metrics 
(https://github.com/apache/hertzbeat/pull/3641) through cardinality sampling 
before each alert. For example:

```
# HELP hertzbeat_alert_cardinality Estimation of the cardinality of labels for 
alert rules or rule groups
# TYPE hertzbeat_alert_cardinality gauge
hertzbeat_alert_cardinality{define="alert_define_1"} 15
hertzbeat_alert_cardinality{define="alert_define_2"} 48000
hertzbeat_alert_cardinality{define="alert_define_3"} 2
hertzbeat_alert_cardinality{} 48017
```

In this implementation approach
1. Since alerts are fundamentally triggered by a set of labeled time series, 
adopting dimension-based collection for alert rules avoids the issue of 
collecting excessive metrics and resulting in an overly high cardinality for 
the metrics themselves.
2. Focus more on the HZB monitoring system itself, rather than performing 
additional statistical analysis and visualization on input metrics.
3. Implementation can be efficient, low-overhead, and visualizable, and may 
also configure `base alerts` for its own metrics.

BTW, why are we talking about this issue on its own? Here’s a real example: 

If I currently have the following metrics: (10,000 REST APIs, 20 jobs, 10 
environments), then this rule could theoretically generate 200,000 tag 
combinations.

When the number of tag combinations becomes extremely large, it leads to the 
so-called **high cardinality problem**. Alert systems or internal alert modules 
maintain a state (firing/resolved) for each unique tag combination, which is 
the source of high-cardinality real-time alerts.

Although most of these interfaces follow the pattern: /item/query/{itemId}, 
such issues typically stem from underlying metric problems. While the root 
cause is not difficult to identify, providing users with analytical 
capabilities through predictive measures can effectively prevent these issues 
in advance.

Therefore, I believe this is an easily overlooked yet highly impactful issue in 
monitoring system performance and maintainability (state explosion, 
push/aggregation pressure, query and display degradation, aggregation analysis, 
storage pressure, etc.). With this capability in place, users can even perform 
additional downsampling, throttling, or discarding when `cardinality alerts` 
occur later on.

FYI, @tomsun28 @bigcyy @zqr10159 


GitHub link: 
https://github.com/apache/hertzbeat/discussions/3845#discussioncomment-14905034

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to