GitHub user Duansg added a comment to the discussion: Discussion on fingerprint generation logic
After some thought, perhaps we could collect this kind of cardinality statistics for alert labels more appropriately by using hertzbeat's own metrics (https://github.com/apache/hertzbeat/pull/3641) through cardinality sampling before each alert. For example: ``` # HELP hertzbeat_alert_cardinality Estimation of the cardinality of labels for alert rules or rule groups # TYPE hertzbeat_alert_cardinality gauge hertzbeat_alert_cardinality{define="alert_define_1"} 15 hertzbeat_alert_cardinality{define="alert_define_2"} 48000 hertzbeat_alert_cardinality{define="alert_define_3"} 2 hertzbeat_alert_cardinality{} 48017 ``` In this implementation approach 1. Since alerts are fundamentally triggered by a set of labeled time series, adopting dimension-based collection for alert rules avoids the issue of collecting excessive metrics and resulting in an overly high cardinality for the metrics themselves. 2. Focus more on the HZB monitoring system itself, rather than performing additional statistical analysis and visualization on input metrics. 3. Implementation can be efficient, low-overhead, and visualizable, and may also configure `base alerts` for its own metrics. BTW, why are we talking about this issue on its own? Here’s a real example: If I currently have the following metrics: (10,000 REST APIs, 20 jobs, 10 environments), then this rule could theoretically generate 200,000 tag combinations. When the number of tag combinations becomes extremely large, it leads to the so-called **high cardinality problem**. Alert systems or internal alert modules maintain a state (firing/resolved) for each unique tag combination, which is the source of high-cardinality real-time alerts. Although most of these interfaces follow the pattern: /item/query/{itemId}, such issues typically stem from underlying metric problems. While the root cause is not difficult to identify, providing users with analytical capabilities through predictive measures can effectively prevent these issues in advance. Therefore, I believe this is an easily overlooked yet highly impactful issue in monitoring system performance and maintainability (state explosion, push/aggregation pressure, query and display degradation, aggregation analysis, storage pressure, etc.). With this capability in place, users can even perform additional downsampling, throttling, or discarding when `cardinality alerts` occur later on. FYI, @tomsun28 @bigcyy @zqr10159 GitHub link: https://github.com/apache/hertzbeat/discussions/3845#discussioncomment-14905034 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
