This is an automated email from the ASF dual-hosted git repository. hanahmily pushed a commit to branch doc-metrics in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git
commit 993c39a086db3feb380aa7090c89687821b950c6 Author: Gao Hongtao <[email protected]> AuthorDate: Tue Sep 24 14:18:14 2024 +0800 Add metrics section in ob doc Signed-off-by: Gao Hongtao <[email protected]> --- docs/operation/observability.md | 234 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 233 insertions(+), 1 deletion(-) diff --git a/docs/operation/observability.md b/docs/operation/observability.md index 7119b4cf..8a90b180 100644 --- a/docs/operation/observability.md +++ b/docs/operation/observability.md @@ -24,6 +24,238 @@ When query tracing is enabled, the slow query log won't be generated. ## Metrics +BanyanDB expose metrics for monitoring and analysis. In this part, we use some variables to represent the metrics, such as `$job` and `$instance`. The `$job` is the job name of the BanyanDB collection job, and the `$instance` is the instance name of the BanyanDB instance. + +`__rate_interval` is a variable that represents the rate interval. It is used to calculate the rate of metrics. + +### Stats + +`Stats` metrics are used to monitor the overall status of BanyanDB. The following metrics are available: + +#### Write Rate + +The write rate is the number of write operations per second. It is calculated by summing the total number of written operations for measures and streams. + +**Expression**: `sum(rate(banyandb_measure_total_written{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) + sum(rate(banyandb_stream_tst_total_written{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval]))` + +#### Total Memory + +The total memory is the total memory available on the system. + +**Expression**: `sum(banyandb_system_memory_state{job=~\"$job\", instance=~\"$instance\",kind=\"total\"})` + +#### Disk Usage + +The disk usage is the percentage of disk space used. If the disk usage is over 80%, it may indicate that the disk is almost full. + +**Expression**: `sum(banyandb_system_disk{job=~\"$job\", instance=~\"$instance\",kind=\"used\"})` + +#### Query Rate + +The query rate is the number of query operations per second. It is the query rate on the liaison server. + +**Expression**: `sum(rate(banyandb_liaison_grpc_total_started{job=~\"$job\", instance=~\"$instance\", method=\"query\"}[$__rate_interval]))` + +#### Total CPU + +The total CPU is the total number of CPUs available on the system. + +**Expression**: `sum(banyandb_system_cpu_num{job=~\"$job\", instance=~\"$instance\"})` + +#### Write and Query Errors Rate + +The write and query errors rate is the number of write and query errors per minute. It is calculated by summing the total number of write and query errors from liaison and data servers. + +**Expression**: `sum(rate(banyandb_liaison_grpc_total_err{job=~\"$job\",instance=~\"$instance\",method=\"query\"}[$__rate_interval])*60) + sum(rate(banyandb_liaison_grpc_total_stream_msg_sent_err{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval])*60) + sum(rate(banyandb_liaison_grpc_total_stream_msg_received_err{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval])*60) + sum(rate(banyandb_queue_sub_total_msg_sent_err{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval])*60)` + +#### Etcd Operation Rate + +The etcd operation rate is the number of etcd operations per second. It is calculated by summing the total number of etcd operations. + +**Expression**: `sum(rate(banyandb_liaison_grpc_total_registry_started{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval])) + sum(rate(banyandb_liaison_grpc_total_started{job=~\"$job\",instance=~\"$instance\",method!=\"query\"}[$__rate_interval]))` + +#### Active Instances + +The active instances is the number of active instances in the BanyanDB cluster. + +**Expression**: `sum(min_over_time(up{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (job)` + +### Resource Usage + +`Resource Usage` metrics are used to monitor the resource usage of BanyanDB on the node. The following metrics are available: + +#### CPU Usage + +The CPU usage is the percentage of CPU used. If the CPU usage is over 80%, it may indicate that the CPU is overloaded. + +**Expression**: `max(rate(process_cpu_seconds_total{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval]) / banyandb_system_cpu_num{job=~\"$job\", instance=~\"$instance\"}) by (job)` + +#### RSS memory usage + +The RSS memory usage is the percentage of resident memory used. If the memory usage is over 80%, it may indicate that the memory is almost full. + +**Expression**: `max(max_over_time(process_resident_memory_bytes{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval]) / sum(banyandb_system_memory_state{job=~\"$job\", instance=~\"$instance\", kind=\"total\"}) by (job,instance)) by(job)` + +#### Disk Usage + +The disk usage is the percentage of disk space used. If the disk usage is over 80%, it may indicate that the disk is almost full. + +**Expression**: `max(sum(banyandb_system_disk{job=~\"$job\", instance=~\"$instance\", kind=\"used\"}) / sum(banyandb_system_memory_state{job=~\"$job\", instance=~\"$instance\", kind=\"total\"})) by (job)` + +#### Network Usage + +The network usage is the number of bytes sent and received per second. + +**Expression1**: `sum(rate(banyandb_system_net_state{job=~\"$job\",instanct=~\"$instance\",kind=\"bytes_recv\"}[$__rate_interval])) by (name)` + +**Expression2**: `sum(rate(banyandb_system_net_state{job=~\"$job\",instanct=~\"$instance\",kind=\"bytes_sent\"}[$__rate_interval])) by (name)` + +### Storage + +`Storage` metrics are used to monitor the storage status of BanyanDB. The following metrics are available: + +#### Write Rate + +The write rate is the number of write operations per second. It is calculated by summing the total number of written operations for measures and streams. It's grouped by the `group` tag. + +You can view the write rate of different instance to find out the hot instance. + +**Expression**: `sum(rate(banyandb_measure_total_written{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (group) + sum(rate(banyandb_stream_tst_total_written{job=~\"$job\", instance=~\"$instance\"}[$__rate_interval])) by (group)` + +#### Query Latency + +The query latency is the average query latency in seconds. It is calculated by summing the total query latency and dividing by the total number of queries. + +You can view the query latency of different instance to find out the instance with high query latency. Because BanyanDB will fetch all instances to query, the query latency of the instance with high query latency will affect the overall query latency. + +**Expression**: `sum(rate(banyandb_liaison_grpc_total_latency{job=~\"$job\", instance=~\"$instance\",method=\"query\"}[$__rate_interval])) by( group) / sum(rate(banyandb_liaison_grpc_total_started{job=~\"$job\", instance=~\"$instance\",method=\"query\"}[$__rate_interval])) by (group)` + +#### Total Data + +The total data is the total number of data points stored in BanyanDB. It's grouped by the `group` tag. + +You can view the total data of different instance to find out the instance with high data points. If the difference between the total data of different instances is too large, it may indicate that the data is not evenly distributed. + +**Expression1**: `sum(banyandb_measure_total_file_elements{job=~\"$job\",instance=~\"$instance\"})by(group)` +**Expression2**: `sum(banyandb_stream_tst_total_file_elements{job=~\"$job\",instance=~\"$instance\"})by(group)` + +#### Merge File Rate + +The merge file rate is the number of merge file operations per minute. It is calculated by summing the total number of merge file operations. It's grouped by the `group` tag. + +If the value surges, it may indicate that too many small files are being merged. It may bring following problems: + +- Increase the disk I/O +- Slow down the query performance +- Increase the CPU usage + +**Expression1**: `sum(rate(banyandb_measure_total_merge_loop_started{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval]))by(group) * 60` +**Expression2**: `sum(rate(banyandb_stream_tst_total_merge_loop_started{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval]))by(group) * 60` + +#### Merge File Latency + +The merge file latency is the average merge file latency in seconds. It is calculated by summing the total merge file latency and dividing by the total number of merge file operations. It's grouped by the `group` tag. + +If the value surges, it may indicate that the merge file operation is slow. It may be caused by the high disk I/O and other resource usage. It may bring following problems: + +- Slow down the query performance +- Increase the CPU usage +- Increase the memory usage + +**Expression1**: `sum(rate(banyandb_measure_total_merge_latency{job=~\"$job\", instance=~\"$instance\",type=\"file\"}[$__rate_interval]))by(group) / sum(rate(banyandb_measure_total_merge_loop_started{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval]))by(group)` +**Expression2**: `sum(rate(banyandb_stream_tst_total_merge_latency{job=~\"$job\", instance=~\"$instance\",type=\"file\"}[$__rate_interval]))by(group) / sum(rate(banyandb_stream_tst_total_merge_loop_started{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval]))by(group)` + +#### Merge File Partitions + +The merge file partitions is the average number of partitions merged per merge file operation. It is calculated by summing the total number of partitions merged and dividing by the total number of merge file operations. It's grouped by the `group` tag. + +If the value surges, it may indicate that too many partitions are being merged. It may because the partition number is too large that indicates the server is under a high write load. + +**Expression1**: `sum(rate(banyandb_measure_total_merged_parts{job=~\"$job\", instance=~\"$instance\",type=\"file\"}[$__rate_interval]))by(group) / sum(rate(banyandb_measure_total_merge_loop_started{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval]))by(group)` + +**Expression2**: `sum(rate(banyandb_stream_tst_total_merged_parts{job=~\"$job\", instance=~\"$instance\",type=\"file\"}[$__rate_interval]))by(group) / sum(rate(banyandb_stream_tst_total_merge_loop_started{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval]))by(group)` + +#### Series Write Rate + +The series write rate is the number of series write operations per second. It is calculated by summing the total number of series write operations for measures and streams. It's grouped by the `group` tag. + +If the value surges, it may indicate that the old series are being updated frequently by the new series. It may be caused by the high cardinality of the series and bring following problems: + +- Increase the series inverted index size +- Slow down the query performance + +**Expression1**: `sum(rate(banyandb_measure_inverted_index_total_updates{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval])) by (group)` +**Expression2**: `sum(rate(banyandb_stream_storage_inverted_index_total_updates{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval])) by (group)` + +##### Series Term Search Rate + +The series term search rate is the number of series term search operations per second. It is calculated by summing the total number of series term search operations for measures and streams. It's grouped by the `group` tag. + +If the value is too large, it may indicate that reading operation fetch too many series. It may be caused by the high cardinality of the series and bring following problems: + +- Slow down the query performance +- Increase the CPU usage +- Increase the memory usage + +**Expression1**: `sum(rate(banyandb_stream_storage_inverted_index_total_term_searchers_started{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval])) by (group)` +**Expression2**: `sum(rate(banyandb_measure_inverted_index_total_term_searchers_started{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval])) by (group)` + +#### Total Series + +The total series is the total number of series stored in BanyanDB. It's grouped by the `group` tag. + +If the value is too large, it may indicate that the high cardinality of the series. It may bring following problems: + +- Increase the series inverted index size +- Slow down the query performance + +**Expression1**: `sum(banyandb_measure_inverted_index_total_doc_count{job=~\"$job\",instance=~\"$instance\"}) by (group)` +**Expression2**: `sum(banyandb_stream_storage_inverted_index_total_doc_count{job=~\"$job\",instance=~\"$instance\"}) by (group)` + +### Stream Inverted Index + +`Stream Inverted Index` metrics are used to monitor the stream inverted index status of BanyanDB. The following metrics are available: + +#### Stream Inverted Index Write Rate + +The write rate is the number of write operations per second. It is calculated by summing the total number of written operations for streams. It's grouped by the `group` tag. + +If the value is too large, it may indicate that too many data points are being indexed and bring following problems: + +- Increase the inverted index size +- Slow down the query performance +- Increase the CPU usage +- Increase the memory usage + +**Expression**: `sum(rate(banyandb_stream_tst_inverted_index_total_updates{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval])) by (group)` + +#### Term Search Rate + +The term search rate is the number of term search operations per second. It is calculated by summing the total number of term search operations for streams. It's grouped by the `group` tag. + +If the value is too large, it may indicate that reading operation fetch too many data points. It may bring following problems: + +- Slow down the query performance +- Increase the CPU usage +- Increase the memory usage + +**Expression**: `sum(rate(banyandb_stream_tst_inverted_index_total_term_searchers_started{job=~\"$job\",instance=~\"$instance\"}[$__rate_interval])) by (group)` + +#### Total Documents + +The total documents is the total number of documents stored in the stream inverted index. It's grouped by the `group` tag. + +If the value is too large, it may indicate that too many data points are being indexed and bring following problems: + +- Increase the inverted index size +- Slow down the query performance +- Increase the CPU usage +- Increase the memory usage + +**Expression**: `sum(banyandb_stream_tst_inverted_index_total_doc_count{job=~\"$job\",instance=~\"$instance\"}) by (group)` + +## Metrics Providers + BanyanDB has built-in support for metrics collection. Currently, there are two supported metrics provider: `prometheus` and `native`. These can be enabled through `observability-modes` flag, allowing you to activate one or both of them. ### Prometheus @@ -32,7 +264,7 @@ Prometheus is auto enabled at run time, if no flag is passed or if `promethus` i When the Prometheus metrics provider is enabled, the metrics server listens on port `2121`. This allows Prometheus to scrape metrics data from BanyanDB for monitoring and analysis. -### Self-observability +### Native If the `observability-modes` flag is set to `native`, the self-observability metrics provider will be enabled. The some of metrics will be displayed in the dashboard of [banyandb-ui](http://localhost:17913/)
