Re: [PR] Introduce `Improving Alert Accuracy with Dynamic Baselines` blog [skywalking-website]

via GitHub Mon, 24 Feb 2025 05:11:10 -0800


wu-sheng commented on code in PR #772:
URL: 
https://github.com/apache/skywalking-website/pull/772#discussion_r1967613370



##########
content/blog/2025-02-24-improving-alert-accuracy-with-dynamic-baselines/index.md:
##########
@@ -0,0 +1,199 @@
+---
+title: "Improving Alert Accuracy with Dynamic Baselines"
+date: 2025-02-24
+author: "Han Liu"
+description: "This article explores how to leverage history metrics to 
generate dynamic baselines for a future period, thereby enhancing the accuracy 
of alerts."
+---
+
+## Background
+
+[Apache SkyWalking](https://skywalking.apache.org/) is an open-source 
application performance monitoring (APM) system
+that collects various data from business applications, including metrics, 
logs, and distributed tracing information,
+and visualizes them through its UI.
+It also allows users to configure alerting rules by setting threshold values 
for specific metrics in the configuration file.
+When a metric associated with a particular service exceeds the predefined 
threshold within a given period, an alert is triggered.
+
+However, in real-world scenarios, traffic patterns and invocation behaviors 
vary across different time periods.
+For example, in a shopping system, the number of purchases is significantly 
lower during late-night hours compared to daytime.
+As a result, system metrics fluctuate within different ranges depending on the 
time of day.
+This makes it challenging to rely solely on static threshold values for 
accurate alerting.
+
+Therefore, dynamically generating thresholds for each time period based on 
historical data becomes crucial.
+
+## Skywalking Predictor with Alarm system
+
+Based on the above scenario, we developed the [SkyWalking 
Predictor](https://github.com/SkyAPM/SkyPredictor/) project to fix this issue.
+SkyWalking Predictor periodically collects data from SkyWalking and generate 
dynamic baselines.
+SkyWalking can then query the Predictor system to obtain predicted metric 
values for the recent period, enabling more precise and adaptive alerting.
+
+### Architecture diagram
+
+![Architecture](./architecture.png)
+
+As shown in the diagram, the process consists of two steps:
+
+1. **Data Collection & Prediction**: The Predictor queries history metrics 
from SkyWalking's OAP via its HTTP service.
+   Then processes this data to generate dynamic predicted values for a future 
time period.
+2. **Baseline Query & Alerting**: The OAP periodically sends queries to the 
Predictor to fetch the predicted dynamic baseline.
+   Then evaluates the current metric values with prediction result using 
**MQE**. If the deviation exceeds a certain threshold, an alert is triggered.
+
+### Data Collection
+
+The Predictor utilizes the following three APIs to query data:
+
+1. [**Status 
API**](https://skywalking.apache.org/docs/main/next/en/status/query_ttl_setup/):
 Retrieves the TTL (Time-to-Live) of history data stored in OAP, helping to 
determine the available time range for exporting all history metrics.
+2. [**Metadata 
API**](https://skywalking.apache.org/docs/main/next/en/api/query-protocol/#v2-apis):
 Fetches the list of services within a specified Layer from OAP, providing 
insights into which services are generating data.
+3. [**MQE 
API**](https://skywalking.apache.org/docs/main/next/en/api/metrics-query-expression/):
 Iterates through the required metrics and the list of services to fetch all 
history metrics values for each metric associated with each service.
+
+These APIs collectively enable the Predictor to gather history metrics data, 
which is then used to compute dynamic baselines for future alerting.
+
+### Prediction
+
+Once the Prediction service collects data from OAP, it proceeds with 
forecasting using the [open-source Prophet 
library](https://github.com/facebook/prophet).
+The prediction process consists of the following steps:
+
+1. **Data Preparation**: The collected metric data is split into multiple 
[DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html),
 each corresponding to a unique combination of service + metric name.
+2. **Data Sufficiency Check**: If a DataFrame contains less than **two days** 
(configurable) of data, the prediction is skipped. This is to ensure accuracy, 
as an insufficient data volume may lead to unreliable forecasts.
+3. **Forecasting**: Using Prophet, the Predictor estimates the metric values 
for **each hour over the next day** (configurable).
+4. **Result Storage**: The generated predictions are stored in local files, 
enabling querying from external services.
+
+#### Predicted Value
+
+The Prediction service supports calculating the following two types of values:
+1. **Predicted Value**: Computes the expected metric value for the next hour 
based on history metrics data.
+2. **Prediction Range**: Determines the possible **upper and lower bounds** 
for the metric in the next hour, representing its expected fluctuation range.
+
+These values help establish a dynamic baseline, allowing the alerting system 
to account for natural variations while accurately detecting anomalies.
+
+### Baseline MQE with Alarm
+
+In OAP, predicted values can be queried directly using an MQE within MQE 
operation. This operation enables retrieving forecasted values for a future 
time period.
+
+Since SkyWalking's alerting system already supports query through MQE 
expressions, users can configure alerts directly in the alerting configuration 
file using MQE.
+
+For more details, please refer to the [official 
documentation](https://skywalking.apache.org/docs/main/next/en/api/metrics-query-expression/#baseline-operation).
+
+### Impact of Data Collection on Prediction Accuracy
+
+The Predictor service supports two different data collection and prediction 
granularity, each with its own trade-offs in accuracy and resource consumption.
+
+1. Minute Level: Collects minutes level metrics data.
+   1. More effective for metrics with high fluctuations, as it captures finer 
details.
+   2. Consumes more resources (OAP, DB CPU and System Load resources, 
Predictor CPU and Memory resources).
+   3. Alerts are configured based on current value comparisons.
+2. Hour Level: Collects hourly metrics data.
+   1. Less resource-intensive compared to minute-level collection.
+   2. Less data volume, resource, and processing cost.
+   3. Alerts are configured based on predicted range values.
+
+| Granularity | Data Fluctuation   | Data Volume | Current Value Prediction 
Accuracy  | Range Prediction Accuracy | Best Use Case                           
                               |
+|-------------|--------------------|-------------|------------------------------------|---------------------------|------------------------------------------------------------------------|
+| Minute      | Higher fluctuation | Large       | Less accurate               
       | More accurate             | Ideal for highly fluctuating metrics, 
using range-based alerting rules |
+| Hour        | Lower fluctuation  | Small       | More accurate               
       | Relatively accurate       | Suitable for stable metrics, using current 
value-based predictions     |
+
+
+Choosing the appropriate granularity depends on the nature of the metric and 
the desired alerting method.
+For metrics with high volatility, minute-level collection provides better 
accuracy when using range-based alerts.
+For stable metrics, hourly aggregation is sufficient and allows for efficient 
predictions using current-value comparisons.
+
+Predict use Hourly level by default.
+
+## Demo
+
+In this section, I will demonstrate how to preview the predicted values of a 
metric by deploying a SkyWalking cluster along with the Predictor service
+in a Kubernetes cluster. This hands-on example will help you understand how to 
use these components effectively.
+
+### Deploy SkyWalking Showcase
+
+SkyWalking Showcase contains a complete set of example services and can be 
monitored using SkyWalking.
+For more information, please check the [official 
documentation](https://skywalking.apache.org/docs/skywalking-showcase/next/readme/).
+
+In this demo, we only deploy the predictor service, SkyWalking OAP, and UI.
+
+```shell
+export FEATURE_FLAGS=single-node,banyandb,baseline
+make deploy.kubernetes
+```
+
+### Import History Data
+
+Since a newly deployed cluster does not contain history data,
+I have created a Python script to simulate data. This allows the Predictor 
service to import data and generate baseline predictions for a future period.
+
+Before importing data, you must expose the `11800` port of the OAP service in 
your Kubernetes cluster.
+You can achieve this using kubectl by running the following command:
+
+```shell
+kubectl port-forward -n skywalking-showcase   service/demo-oap 11800:11800
+```
+
+Then, you can download and run the demo script using the following command:
+
+```shell
+# clone and get into the demo repository
+git clone https://github.com/mrproliu/SkyPredictorDemo && cd SkyPredictorDemo
+# installing dependencies
+make install
+# import data(7 days)
+python3 -m client.generate localhost:11800 7
+```
+
+Finally, you can see the output in the console: **Metrics send success!**.
+
+### Prediction metrics
+
+Since the Predictor service runs based on a **cron schedule**, it does not 
automatically execute immediately after data import.
+To force it to collect data and perform a prediction, you can manually delete 
the Predictor pod, prompting Kubernetes to restart it:
+
+```shell
+kubectl delete pod -n skywalking-showcase $(kubectl get pods -n 
skywalking-showcase --no-headers -o custom-columns=":metadata.name" | grep 
"skywalking-predictor")
+```
+
+Once the Predictor pod restarts, you can check its logs to confirm that the 
prediction process has been completed.
+
+```
+Predicted for e2e-test-dest-service of service_xxx to xxxx-xx-xx xx:xx:xx.
+```
+
+### View in SkyWalking UI
+
+Once the prediction process is complete, you can visualize the predicted 
values in the SkyWalking UI by configuring the appropriate metric widgets.
+
+First, Run the following command to forward the UI service port to your local 
machine:
+
+```shell
+kubectl port-forward svc/demo-ui 8080:80 --namespace skywalking-showcase
+```
+
+Then, you can access this page to view the service traffic that was generated 
using the Python script earlier:
+http://localhost:8080/dashboard/MESH/Service/ZTJlLXRlc3QtZGVzdC1zZXJ2aWNl.1/Mesh-Service
+
+To display predicted values, edit the Service Avg Resp Time Widget and add the 
following MQE:
+
+```
+# The maximum predicted response time.
+baseline(service_resp_time, upper)
+# The predicted response time.
+baseline(service_resp_time, value)
+# The minimum predicted response time.
+baseline(service_resp_time, lower)
+```
+
+Finally, you can see the predicted values displayed in the widget.
+
+![Predicted Widget](predicted_widget.png)
+
+Since the default data collection is hourly and the metric has significant 
fluctuations,
+the predicted values are derived from hourly averages rather than minute-level 
granularity.
+This approach smooths out fluctuations and provides a more stable baseline for 
monitoring.
+
+Now, you should see the predicted response times visualized alongside actual 
values, helping you analyze trends and configure dynamic alerting thresholds 
effectively.

Review Comment:
   @mrproliu I discussed with Kai, OAP is querying T-24H to T+24H time range 
baselines. So, if the predictor generates all baselines every 8 hours, then the 
range would be T-8H to T+16H, or T-1H to T+23H.
   We should state this clearly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Introduce `Improving Alert Accuracy with Dynamic Baselines` blog [skywalking-website]

Reply via email to