This is an automated email from the ASF dual-hosted git repository.
wusheng pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/skywalking-website.git
The following commit(s) were added to refs/heads/master by this push:
new 2fd4fc9e51f Introduce `Improving Alert Accuracy with Dynamic
Baselines` blog (#772)
2fd4fc9e51f is described below
commit 2fd4fc9e51f22f65792fa60a0bbd13b39ff00b1a
Author: mrproliu <[email protected]>
AuthorDate: Tue Feb 25 10:56:41 2025 +0800
Introduce `Improving Alert Accuracy with Dynamic Baselines` blog (#772)
---
.../architecture.png | Bin 0 -> 23734 bytes
.../index.md | 217 +++++++++++++++++++++
.../predicted_widget.png | Bin 0 -> 143602 bytes
3 files changed, 217 insertions(+)
diff --git
a/content/blog/2025-02-24-improving-alert-accuracy-with-dynamic-baselines/architecture.png
b/content/blog/2025-02-24-improving-alert-accuracy-with-dynamic-baselines/architecture.png
new file mode 100644
index 00000000000..de128b0ce5a
Binary files /dev/null and
b/content/blog/2025-02-24-improving-alert-accuracy-with-dynamic-baselines/architecture.png
differ
diff --git
a/content/blog/2025-02-24-improving-alert-accuracy-with-dynamic-baselines/index.md
b/content/blog/2025-02-24-improving-alert-accuracy-with-dynamic-baselines/index.md
new file mode 100644
index 00000000000..a3aacc613fe
--- /dev/null
+++
b/content/blog/2025-02-24-improving-alert-accuracy-with-dynamic-baselines/index.md
@@ -0,0 +1,217 @@
+---
+title: "Improving Alert Accuracy with Dynamic Baselines"
+date: 2025-02-24
+author: "Han Liu"
+description: "This article explores how to leverage history metrics to
generate dynamic baselines for a future period, thereby enhancing the accuracy
of alerts."
+---
+
+## Background
+
+[Apache SkyWalking](https://skywalking.apache.org/) is an open-source
application performance monitoring (APM) system
+that collects various data from business applications, including metrics,
logs, and distributed tracing information,
+and visualizes them through its UI.
+It also allows users to configure alerting rules by setting threshold values
for specific metrics in the configuration file.
+When a metric associated with a particular service exceeds the predefined
threshold within a given period, an alert is triggered.
+
+However, in real-world scenarios, traffic patterns and invocation behaviors
vary across different time periods.
+For example, in a shopping system, the number of purchases is significantly
lower during late-night hours compared to daytime.
+As a result, system metrics fluctuate within different ranges depending on the
time of day.
+This makes it challenging to rely solely on static threshold values for
accurate alerting.
+
+Therefore, dynamically generating thresholds for each time period based on
historical data becomes crucial.
+
+## Introduce SkyAPM SkyPredictor
+
+Based on the above scenario, we developed the [SkyAPM
SkyPredictor](https://github.com/SkyAPM/SkyPredictor/) project to fix this
issue.
+SkyAPM SkyPredictor periodically collects data from SkyWalking and generates
dynamic baselines.
+Meanwhile, SkyWalking queries from SkyPredictor to obtain predicted metric
values for the recent period, enabling more precise and adaptive alerting.
+
+NOTE: SkyWalking does not have a hard dependency on the SkyPredictor service.
+If SkyPredictor is not configured, no predicted values would be retrieved, and
not cause any failures in SkyWalking.
+Additionally, you can use your own AI engine to build a custom prediction
system. Simply implement the required protocol as outlined in the official
documentation:
+https://skywalking.apache.org/docs/main/next/en/setup/ai-pipeline/metrics-baseline-integration/
+
+### Architecture diagram
+
+
+
+As shown in the diagram, the process consists of two steps:
+
+1. **Data Collection & Prediction**: The Predictor queries history metrics
from SkyWalking's OAP via its HTTP service.
+ Then processes this data to generate dynamic predicted values for a future
time period.
+2. **Baseline Query & Alerting**: The OAP periodically sends queries to the
Predictor to fetch the predicted dynamic baseline.
+ Then evaluates the current metric values with prediction result using
**MQE**. If the deviation exceeds a certain threshold, an alert is triggered.
+
+### Data Collection
+
+The Predictor utilizes the following three APIs to query data:
+
+1. [**Status
API**](https://skywalking.apache.org/docs/main/next/en/status/query_ttl_setup/):
Retrieves the TTL (Time-to-Live) of history data stored in OAP, helping to
determine the available time range for exporting all history metrics.
+2. [**Metadata
API**](https://skywalking.apache.org/docs/main/next/en/api/query-protocol/#v2-apis):
Fetches the list of services within a specified Layer from OAP, providing
insights into which services are generating data.
+3. [**MQE
API**](https://skywalking.apache.org/docs/main/next/en/api/metrics-query-expression/):
Iterates through the required metrics and the list of services to fetch all
history metrics values for each metric associated with each service.
+
+These APIs collectively enable the Predictor to gather history metrics data,
which is then used to compute dynamic baselines for future alerting.
+
+### Prediction
+
+Once the Prediction service collects data from OAP, it proceeds with
forecasting using the [open-source Prophet
library](https://github.com/facebook/prophet).
+The prediction process consists of the following steps:
+
+1. **Data Preparation**: The collected metric data is split into multiple
[DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html),
each corresponding to a unique combination of service + metric name.
+2. **Data Sufficiency Check**: If a DataFrame contains less than **two days**
(configurable) of data, the prediction is skipped. This is to ensure accuracy,
as an insufficient data volume may lead to unreliable forecasts.
+3. **Forecasting**: Using Prophet, the Predictor estimates the metric values
for **each hour over the next day** (configurable).
+4. **Result Storage**: The generated predictions are stored in local files,
enabling querying from external services.
+
+#### Predicted Value
+
+The Prediction service supports calculating the following two types of values:
+1. **Predicted Value**: Computes the expected metric value for the next hour
based on history metrics data.
+2. **Prediction Range**: Determines the possible **upper and lower bounds**
for the metric in the next hour, representing its expected fluctuation range.
+
+These values help establish a dynamic baseline, allowing the alerting system
to account for natural variations while accurately detecting anomalies.
+
+### Baseline MQE with Alarm
+
+In OAP, predicted values can be queried directly using an MQE within MQE
operation. This operation enables retrieving forecasted values for a future
time period.
+
+Since SkyWalking's alerting system already supports query through MQE
expressions, users can configure alerts directly in the alerting configuration
file using MQE.
+
+For more details, please refer to the [official
documentation](https://skywalking.apache.org/docs/main/next/en/api/metrics-query-expression/#baseline-operation).
+
+### Impact of Data Collection on Prediction Accuracy
+
+The Predictor service supports two different data collection and prediction
granularity, each with its own trade-offs in accuracy and resource consumption.
+
+1. Minute Level: Collects minutes level metrics data.
+ 1. More effective for metrics with high fluctuations, as it captures finer
details.
+ 2. Consumes more resources (OAP, DB CPU and System Load resources,
Predictor CPU and Memory resources).
+ 3. Alerts are configured based on current value comparisons.
+2. Hour Level: Collects hourly metrics data.
+ 1. Less resource-intensive compared to minute-level collection.
+ 2. Less data volume, resource, and processing cost.
+ 3. Alerts are configured based on predicted range values.
+
+| Granularity | Data Fluctuation | Data Volume | Current Value Prediction
Accuracy | Range Prediction Accuracy | Best Use Case
|
+|-------------|--------------------|-------------|------------------------------------|---------------------------|------------------------------------------------------------------------|
+| Minute | Higher fluctuation | Large | Less accurate
| More accurate | Ideal for highly fluctuating metrics,
using range-based alerting rules |
+| Hour | Lower fluctuation | Small | More accurate
| Relatively accurate | Suitable for stable metrics, using current
value-based predictions |
+
+
+Choosing the appropriate granularity depends on the nature of the metric and
the desired alerting method.
+For metrics with high volatility, minute-level collection provides better
accuracy when using range-based alerts.
+For stable metrics, hourly aggregation is sufficient and allows for efficient
predictions using current-value comparisons.
+
+Predict use Hourly level by default.
+
+### OAP and Predictor Scheduling & Caching
+
+Both SkyWalking OAP and SkyAPM Predictor implement caching strategies to
prevent excessive execution and optimize resource usage.
+
+By default, Predictor runs at 00:10, 08:10, and 16:10 every day. It forecasts
the next 24 hours and stores the results locally.
+Updating predictions every 8 hours balances resource efficiency and real-time
accuracy. The 10-minute delay
+(instead of running at exactly 00:00, 08:00, etc.) ensures historical data is
fully written to the database before querying.
+
+OAP queries Predictor for all required predicted metrics of a single service.
The query covers a ±24-hour time range from the current moment.
+Results are cached for one hour to reduce redundant queries and improve
efficiency.
+
+These mechanisms ensure that predictions remain up-to-date, while minimizing
unnecessary processing and system load.
+
+## Demo
+
+In this section, I will demonstrate how to preview the predicted values of a
metric by deploying a SkyWalking cluster along with the Predictor service
+in a Kubernetes cluster. This hands-on example will help you understand how to
use these components effectively.
+
+### Deploy SkyWalking Showcase
+
+SkyWalking Showcase contains a complete set of example services and can be
monitored using SkyWalking.
+For more information, please check the [official
documentation](https://skywalking.apache.org/docs/skywalking-showcase/next/readme/).
+
+In this demo, we only deploy the predictor service, SkyWalking OAP, and UI.
+
+```shell
+export FEATURE_FLAGS=single-node,banyandb,baseline
+make deploy.kubernetes
+```
+
+### Import History Data
+
+Since a newly deployed cluster does not contain history data,
+I have created a Python script to simulate data. This allows the Predictor
service to import data and generate baseline predictions for a future period.
+
+Before importing data, you must expose the `11800` port of the OAP service in
your Kubernetes cluster.
+You can achieve this using kubectl by running the following command:
+
+```shell
+kubectl port-forward -n skywalking-showcase service/demo-oap 11800:11800
+```
+
+Then, you can download and run the demo script using the following command:
+
+```shell
+# clone and get into the demo repository
+git clone https://github.com/mrproliu/SkyPredictorDemo && cd SkyPredictorDemo
+# installing dependencies
+make install
+# import data(7 days)
+python3 -m client.generate localhost:11800 7
+```
+
+Finally, you can see the output in the console: **Metrics send success!**.
+
+### Prediction metrics
+
+Since the Predictor service runs based on a **cron schedule**, it does not
automatically execute immediately after data import.
+To force it to collect data and perform a prediction, you can manually delete
the Predictor pod, prompting Kubernetes to restart it:
+
+```shell
+kubectl delete pod -n skywalking-showcase $(kubectl get pods -n
skywalking-showcase --no-headers -o custom-columns=":metadata.name" | grep
"skywalking-predictor")
+```
+
+Once the Predictor pod restarts, you can check its logs to confirm that the
prediction process has been completed.
+
+```
+Predicted for e2e-test-dest-service of service_xxx to xxxx-xx-xx xx:xx:xx.
+```
+
+### View in SkyWalking UI
+
+Once the prediction process is complete, you can visualize the predicted
values in the SkyWalking UI by configuring the appropriate metric widgets.
+
+First, Run the following command to forward the UI service port to your local
machine:
+
+```shell
+kubectl port-forward svc/demo-ui 8080:80 --namespace skywalking-showcase
+```
+
+Then, you can access this page to view the service traffic that was generated
using the Python script earlier:
+http://localhost:8080/dashboard/MESH/Service/ZTJlLXRlc3QtZGVzdC1zZXJ2aWNl.1/Mesh-Service
+
+To display predicted values, edit the Service Avg Resp Time Widget and add the
following MQE:
+
+```
+# The maximum predicted response time.
+baseline(service_resp_time, upper)
+# The predicted response time.
+baseline(service_resp_time, value)
+# The minimum predicted response time.
+baseline(service_resp_time, lower)
+```
+
+Finally, you can see the predicted values displayed in the widget.
+
+
+
+Since the default data collection is hourly and the metric has significant
fluctuations,
+the predicted values are derived from hourly averages rather than minute-level
granularity.
+This approach smooths out fluctuations and provides a more stable baseline for
monitoring.
+
+Now, you should see the predicted response times visualized alongside actual
values, helping you analyze trends and configure dynamic alerting thresholds
effectively.
+
+## Conclusion
+
+SkyAPM SkyPredictor enhances alert accuracy by using dynamic baselines instead
of static thresholds.
+It collects history metrics data, forecasts future values with Prophet, and
supports minute or hour-level collection for better precision.
+By integrating predictions into SkyWalking UI, users can optimize alerting and
improve system observability.
+
+By integrating dynamic thresholds, SkyWalking can adapt to traffic patterns
and detect anomalies more effectively,
+reducing false positives and improving system observability.
diff --git
a/content/blog/2025-02-24-improving-alert-accuracy-with-dynamic-baselines/predicted_widget.png
b/content/blog/2025-02-24-improving-alert-accuracy-with-dynamic-baselines/predicted_widget.png
new file mode 100644
index 00000000000..6cc27778b8b
Binary files /dev/null and
b/content/blog/2025-02-24-improving-alert-accuracy-with-dynamic-baselines/predicted_widget.png
differ