This is an automated email from the ASF dual-hosted git repository.
hanahmily pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git
The following commit(s) were added to refs/heads/main by this push:
new 4421b30de fix(lifecycle): resolve sender identity and add lifecycle
migration dashboard (#1178)
4421b30de is described below
commit 4421b30de5644c53d6e5e7473f2bd6297ab1531d
Author: Gao Hongtao <[email protected]>
AuthorDate: Tue Jun 16 08:57:35 2026 +0800
fix(lifecycle): resolve sender identity and add lifecycle migration
dashboard (#1178)
* docs(grafana): add lifecycle migration dashboard; move Tier Migrations
panel from nodes dashboard
- Add docs/operation/grafana-fodc-migration.json: dedicated dashboard for
lifecycle migration health, cycle status, throughput, latency, and
pub↔sub drift across all migration flows
- Remove Tier Migrations panel (62) from grafana-fodc-nodes.json; keep
Streaming Flows panel (61) there; close the layout gap
- Fix file-sync sub pairing: use total_finished{operation="file-sync"}
since chunked_sync receiver has no per-message counter
- Drop $role/$pod coupling; migration dashboard uses $job/$group/$operation
vars derived from lifecycle container metrics
---
CHANGES.md | 2 +-
banyand/backup/lifecycle/service.go | 8 -
banyand/backup/lifecycle/steps.go | 2 +-
...fodc-nodes.json => grafana-fodc-migration.json} | 2062 +++++++-------------
docs/operation/grafana-fodc-nodes.json | 562 +-----
test/cases/lifecycle/lifecycle.go | 16 +-
test/e2e-v2/cases/fodc/metrics/documented_gap.txt | 3 +
7 files changed, 747 insertions(+), 1908 deletions(-)
diff --git a/CHANGES.md b/CHANGES.md
index 77f4d946d..d0d7ee8d6 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -9,7 +9,7 @@ Release Notes.
- Redesign the queue (`queue_pub`/`queue_sub`) metrics around a uniform model:
keep only `total_started`, `total_finished`, `total_latency` (now a histogram)
and `total_err`, plus file-sync-only `sent_bytes` (pub) / `received_bytes`
(sub). Replace the `topic` label with `operation`
(`batch-write`/`file-sync`/`query`/`control`) and `group`, add an `error_type`
label on `total_err`, and add remote-endpoint labels
(`remote_node`/`remote_role`/`remote_tier`) so the liaison↔data (hot/warm/col
[...]
- Stamp the lifecycle's tier-migration publisher's identity onto the wire so
the receiving data node records a non-empty
`remote_node`/`remote_role`/`remote_tier` on its
`banyandb_queue_sub_total_finished` series. The lifecycle's `parseGroup`
resolves the lifecycle's self identity by matching the lifecycle pod's hostname
(POD_NAME via the K8s downward API, falling back to `os.Hostname()` — same
precedence as `nativeNodeContext` at `banyand/backup/lifecycle/service.go`)
against the data-n [...]
- Add `banyandb_lifecycle_last_run_timestamp_seconds` and
`banyandb_lifecycle_last_run_success` gauges to the lifecycle service for
at-a-glance health monitoring. `last_run_timestamp_seconds` records the
wall-clock epoch (in seconds) of the most recent migration cycle;
`last_run_success` is `1` on a nil error and `0` otherwise. Both are stamped by
a `defer` at the end of `action()` so every return path (success, error,
recovered panic) updates the pair atomically — dashboards can pin an [...]
-- Refactor the lifecycle cycle-level metrics
(`banyandb_lifecycle_cycles_total`,
`banyandb_lifecycle_last_run_timestamp_seconds`,
`banyandb_lifecycle_last_run_success`) to carry labels `remote_node`,
`remote_role`, `remote_tier`, `group`. The label form mirrors the per-message
`banyandb_lifecycle_migration_*` family emitted by the queue/pub lifecycle
publisher, but the two families describe DIFFERENT things: the cycle-level
series describe the SENDER (the lifecycle pod's co-located data [...]
+- Refactor the lifecycle cycle-level metrics
(`banyandb_lifecycle_cycles_total`,
`banyandb_lifecycle_last_run_timestamp_seconds`,
`banyandb_lifecycle_last_run_success`) to carry labels `remote_node`,
`remote_role`, `remote_tier`, `group`. The label form mirrors the per-message
`banyandb_lifecycle_migration_*` family emitted by the queue/pub lifecycle
publisher, but the two families describe DIFFERENT things: the cycle-level
series describe the SENDER (the lifecycle pod's co-located data [...]
- Remove `banyandb_lifecycle_self_identity_resolution_total`. The
regression-detection role moves to the now-labeled
`banyandb_lifecycle_cycles_total{remote_node!=""}` (an empty `remote_node`
series means the registry match failed for every group, the bug the old counter
caught), plus the existing receiver-side count of empty `remote_node` on
lifecycle `banyandb_queue_sub_total_finished` series. The wire-level
`cluster.v1.SendRequest` sender-identity fields are unchanged.
- Vectorized measure query path is now enabled by default. The columnar
pipeline replaces per-row protobuf serialization in `NewMIterator`, cutting
allocations and ns/op for scan-heavy measure queries; gRPC wire format
(`*measurev1.InternalDataPoint`) is byte-identical. Single-node coverage is
complete: scan, GroupBy+Agg via `BatchAggregation`, scalar reduce (`Agg`
without `GroupBy`), raw `GroupBy` (without `Agg`), implicit projection coverage
for GroupBy/Agg fields, `TopN`/`BottomN`, `o [...]
- Add validation to ensure Measure's ShardingKey contains all Entity tags to
guarantee entity locality.
diff --git a/banyand/backup/lifecycle/service.go
b/banyand/backup/lifecycle/service.go
index 19338f1b1..9cffac450 100644
--- a/banyand/backup/lifecycle/service.go
+++ b/banyand/backup/lifecycle/service.go
@@ -260,7 +260,6 @@ func (l *lifecycleService) PreRun(_ context.Context) error {
l.cyclesTotal = lifecycleScope.NewCounter("cycles_total",
cycleLabels...)
l.lastRunTimestamp =
lifecycleScope.NewGauge("last_run_timestamp_seconds", cycleLabels...)
l.lastRunSuccess = lifecycleScope.NewGauge("last_run_success",
cycleLabels...)
-
if l.schedule != "" && l.lifecycleTLS {
var err error
l.tlsReloader, err = pkgtls.NewReloader(l.lifecycleCertFile,
l.lifecycleKeyFile, l.l)
@@ -572,13 +571,6 @@ func (l *lifecycleService) action(ctx context.Context)
(err error) {
l.lastRunNode = ""
l.lastRunRole = ""
l.lastRunTier = ""
- // Do NOT reset the emittedLastRun* fields here — they carry the
- // (group, remote_*) tuple of the last series actually Set on
- // Prometheus, which recordLastRun needs to Delete in the next
- // cycle so the previous cycle's series doesn't accumulate as a
- // stale labeled gauge. An empty cycle still has a previous emitted
- // tuple to clean up; the new cycle's Set will then re-stamp with
- // the current (possibly empty) labels.
// Stamp last-run metrics at the end of this cycle regardless of
outcome.
// Using defer keeps the success/error bookkeeping in one place even as
// the body grows new early returns; the metrics gauge Set()s observe
diff --git a/banyand/backup/lifecycle/steps.go
b/banyand/backup/lifecycle/steps.go
index 0bcf801e3..72d4d5b30 100644
--- a/banyand/backup/lifecycle/steps.go
+++ b/banyand/backup/lifecycle/steps.go
@@ -367,7 +367,7 @@ func parseGroup(
selfHost := selfPodHostname()
senderNode, senderTier, resolvedOK := resolveSelfIdentity(selfHost,
nodes)
if resolvedOK {
- senderRole = "lifecycle"
+ senderRole = lifecycleRoleName
client.SetSelfNode(senderNode, senderRole, senderTier)
// Info log so operators can see which identity the agent
// stamped on the wire, and which co-located data pod the
diff --git a/docs/operation/grafana-fodc-nodes.json
b/docs/operation/grafana-fodc-migration.json
similarity index 58%
copy from docs/operation/grafana-fodc-nodes.json
copy to docs/operation/grafana-fodc-migration.json
index 404b0fa91..50f9e26a8 100644
--- a/docs/operation/grafana-fodc-nodes.json
+++ b/docs/operation/grafana-fodc-migration.json
@@ -72,9 +72,9 @@
"x": 0,
"y": 0
},
- "id": 1,
+ "id": 100,
"panels": [],
- "title": "Fleet Overview",
+ "title": "Migration Health Overview",
"type": "row"
},
{
@@ -82,7 +82,79 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Reporting node count split by role (liaison / data).",
+ "description": "Time since the most recent lifecycle cycle completed
across all source pods/groups. Daily-batch SLA: red if older than 26h.",
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ },
+ {
+ "color": "red",
+ "value": 93600
+ }
+ ]
+ },
+ "unit": "s"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 4,
+ "x": 0,
+ "y": 1
+ },
+ "id": 101,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "none",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "percentChangeColorMode": "standard",
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": "",
+ "values": false
+ },
+ "showPercentChange": false,
+ "textMode": "auto",
+ "wideLayout": true
+ },
+ "pluginVersion": "11.2.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "exemplar": false,
+ "expr": "time() -
max(banyandb_lifecycle_last_run_timestamp_seconds{job=~\"$job\",
group=~\"$group\"})",
+ "instant": false,
+ "legendFormat": "__auto",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Last Migration (age)",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "description": "Lifecycle migration cycles completed in the last 24h
(sum of increase over banyandb_lifecycle_cycles_total).",
"fieldConfig": {
"defaults": {
"color": {
@@ -103,12 +175,12 @@
"overrides": []
},
"gridPos": {
- "h": 5,
- "w": 6,
- "x": 6,
+ "h": 4,
+ "w": 4,
+ "x": 4,
"y": 1
},
- "id": 3,
+ "id": 102,
"options": {
"colorMode": "value",
"graphMode": "none",
@@ -134,14 +206,15 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "count(banyandb_system_up_time{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}) by (container_name)",
+ "exemplar": false,
+ "expr": "sum(increase(banyandb_lifecycle_cycles_total{job=~\"$job\",
group=~\"$group\"}[24h]))",
"instant": false,
- "legendFormat": "{{container_name}}",
+ "legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
- "title": "Nodes by Role",
+ "title": "Cycles (24h)",
"type": "stat"
},
{
@@ -149,7 +222,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Total CPU cores across the selected nodes.",
+ "description": "Number of source pods whose most recent migration cycle
reported failure (last_run_success==0). Red if any.",
"fieldConfig": {
"defaults": {
"color": {
@@ -162,6 +235,10 @@
{
"color": "green",
"value": null
+ },
+ {
+ "color": "red",
+ "value": 1
}
]
},
@@ -170,12 +247,12 @@
"overrides": []
},
"gridPos": {
- "h": 5,
- "w": 6,
- "x": 12,
+ "h": 4,
+ "w": 4,
+ "x": 8,
"y": 1
},
- "id": 4,
+ "id": 103,
"options": {
"colorMode": "value",
"graphMode": "none",
@@ -201,14 +278,15 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(banyandb_system_cpu_num{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"})",
+ "exemplar": false,
+ "expr": "count(banyandb_lifecycle_last_run_success{job=~\"$job\",
group=~\"$group\"} == 0) or vector(0)",
"instant": false,
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
- "title": "Total CPU Cores",
+ "title": "Last-run Failures",
"type": "stat"
},
{
@@ -216,7 +294,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Total used physical memory across the selected nodes.",
+ "description": "Total bytes sent by the lifecycle publisher in the last
24h (file-sync sent_bytes).",
"fieldConfig": {
"defaults": {
"color": {
@@ -237,12 +315,12 @@
"overrides": []
},
"gridPos": {
- "h": 5,
- "w": 6,
- "x": 18,
+ "h": 4,
+ "w": 4,
+ "x": 12,
"y": 1
},
- "id": 5,
+ "id": 104,
"options": {
"colorMode": "value",
"graphMode": "none",
@@ -268,14 +346,15 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(banyandb_system_memory_state{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",kind=\"used\"})",
+ "exemplar": false,
+ "expr":
"sum(increase(banyandb_lifecycle_migration_sent_bytes{job=~\"$job\",
group=~\"$group\"}[24h]))",
"instant": false,
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
- "title": "Total Memory Used",
+ "title": "Data Migrated (24h)",
"type": "stat"
},
{
@@ -283,7 +362,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Total used disk across the selected nodes and all
storage paths.",
+ "description": "Total lifecycle migration errors in the last 24h. Red if
any.",
"fieldConfig": {
"defaults": {
"color": {
@@ -296,20 +375,92 @@
{
"color": "green",
"value": null
+ },
+ {
+ "color": "red",
+ "value": 1
}
]
},
- "unit": "bytes"
+ "unit": "short"
},
"overrides": []
},
"gridPos": {
- "h": 5,
- "w": 6,
- "x": 0,
+ "h": 4,
+ "w": 4,
+ "x": 16,
+ "y": 1
+ },
+ "id": 105,
+ "options": {
+ "colorMode": "value",
+ "graphMode": "none",
+ "justifyMode": "auto",
+ "orientation": "auto",
+ "percentChangeColorMode": "standard",
+ "reduceOptions": {
+ "calcs": [
+ "lastNotNull"
+ ],
+ "fields": "",
+ "values": false
+ },
+ "showPercentChange": false,
+ "textMode": "auto",
+ "wideLayout": true
+ },
+ "pluginVersion": "11.2.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "exemplar": false,
+ "expr":
"sum(increase(banyandb_lifecycle_migration_total_err{job=~\"$job\"}[24h])) or
vector(0)",
+ "instant": false,
+ "legendFormat": "__auto",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Migration Errors (24h)",
+ "type": "stat"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "description": "Distinct source pods currently emitting lifecycle
migration metrics.",
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "thresholds"
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 4,
+ "w": 4,
+ "x": 20,
"y": 1
},
- "id": 9,
+ "id": 106,
"options": {
"colorMode": "value",
"graphMode": "none",
@@ -335,14 +486,15 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(banyandb_system_disk{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",kind=\"used\"})",
+ "exemplar": false,
+ "expr": "count(count by
(pod_name)(banyandb_lifecycle_migration_total_finished{job=~\"$job\"}))",
"instant": false,
"legendFormat": "__auto",
"range": true,
"refId": "A"
}
],
- "title": "Total Disk Used",
+ "title": "Active Source Pods",
"type": "stat"
},
{
@@ -351,11 +503,11 @@
"h": 1,
"w": 24,
"x": 0,
- "y": 6
+ "y": 5
},
- "id": 11,
+ "id": 110,
"panels": [],
- "title": "Per-node Health",
+ "title": "Cycle Status",
"type": "row"
},
{
@@ -363,7 +515,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Per-node health snapshot: role, uptime, CPU cores used,
RSS, system memory %, and disk % (across all paths).",
+ "description": "Completed migration cycles per source pod and group
(banyandb_lifecycle_cycles_total). This counter is attributed to the SOURCE
node only \u2014 its remote_* labels carry the lifecycle node's own (sender)
identity, not the migration destination, so there is no destination column
here. A single cycle may fan out to multiple destination tiers; see the Flows
table for the per-flow source\u2192dest breakdown.",
"fieldConfig": {
"defaults": {
"color": {
@@ -387,138 +539,15 @@
]
}
},
- "overrides": [
- {
- "matcher": {
- "id": "byName",
- "options": "Uptime"
- },
- "properties": [
- {
- "id": "unit",
- "value": "s"
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "CPU (cores)"
- },
- "properties": [
- {
- "id": "unit",
- "value": "short"
- },
- {
- "id": "decimals",
- "value": 2
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "RSS"
- },
- "properties": [
- {
- "id": "unit",
- "value": "bytes"
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Mem %"
- },
- "properties": [
- {
- "id": "unit",
- "value": "percent"
- },
- {
- "id": "thresholds",
- "value": {
- "mode": "absolute",
- "steps": [
- {
- "color": "green",
- "value": null
- },
- {
- "color": "red",
- "value": 80
- }
- ]
- }
- },
- {
- "id": "color",
- "value": {
- "mode": "thresholds"
- }
- },
- {
- "id": "custom.cellOptions",
- "value": {
- "type": "color-background",
- "mode": "gradient"
- }
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Disk %"
- },
- "properties": [
- {
- "id": "unit",
- "value": "percentunit"
- },
- {
- "id": "thresholds",
- "value": {
- "mode": "absolute",
- "steps": [
- {
- "color": "green",
- "value": null
- },
- {
- "color": "red",
- "value": 0.8
- }
- ]
- }
- },
- {
- "id": "color",
- "value": {
- "mode": "thresholds"
- }
- },
- {
- "id": "custom.cellOptions",
- "value": {
- "type": "color-background",
- "mode": "gradient"
- }
- }
- ]
- }
- ]
+ "overrides": []
},
"gridPos": {
- "h": 9,
- "w": 24,
+ "h": 8,
+ "w": 12,
"x": 0,
- "y": 7
+ "y": 6
},
- "id": 12,
+ "id": 111,
"options": {
"showHeader": true,
"cellHeight": "sm",
@@ -539,74 +568,16 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "banyandb_system_up_time{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}",
+ "exemplar": false,
+ "expr": "sum by (node_type, pod_name,
group)(banyandb_lifecycle_cycles_total{job=~\"$job\", group=~\"$group\"})",
+ "format": "table",
"instant": true,
- "legendFormat": "",
+ "legendFormat": "__auto",
"range": false,
- "refId": "A",
- "format": "table"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "expr": "sum(rate(process_cpu_seconds_total{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__rate_interval])) by
(pod_name)",
- "instant": true,
- "legendFormat": "",
- "range": false,
- "refId": "B",
- "format": "table"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "expr": "sum(process_resident_memory_bytes{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}) by (pod_name)",
- "instant": true,
- "legendFormat": "",
- "range": false,
- "refId": "C",
- "format": "table"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "expr": "max(banyandb_system_memory_state{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",kind=\"used_percent\"}) by
(pod_name)",
- "instant": true,
- "legendFormat": "",
- "range": false,
- "refId": "D",
- "format": "table"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "expr": "max by (pod_name) ( banyandb_system_disk{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",kind=\"used\"} / on (pod_name,
path) banyandb_system_disk{job=~\"$job\", container_name=~\"$role\",
pod_name=~\"$pod\",kind=\"total\"})",
- "instant": true,
- "legendFormat": "",
- "range": false,
- "refId": "E",
- "format": "table"
+ "refId": "A"
}
],
"transformations": [
- {
- "id": "joinByField",
- "options": {
- "byField": "pod_name",
- "mode": "outer"
- }
- },
{
"id": "organize",
"options": {
@@ -614,52 +585,38 @@
"Time": true,
"job": true,
"instance": true,
+ "pod": true,
+ "container_name": true,
"node_role": true,
+ "remote_role": true,
+ "remote_tier": true,
+ "remote_node": true,
"__name__": true
},
"renameByName": {
- "pod_name": "Node",
- "container_name": "Role",
- "Value #A": "Uptime",
- "Value #B": "CPU (cores)",
- "Value #C": "RSS",
- "Value #D": "Mem %",
- "Value #E": "Disk %"
+ "node_type": "Src tier",
+ "pod_name": "Src pod",
+ "group": "Group",
+ "Value": "Cycles"
},
"indexByName": {
- "pod_name": 0,
- "container_name": 1,
- "Value #A": 2,
- "Value #B": 3,
- "Value #C": 4,
- "Value #D": 5,
- "Value #E": 6
+ "node_type": 0,
+ "pod_name": 1,
+ "group": 2,
+ "Value": 3
}
}
}
],
- "title": "Per-node Health",
+ "title": "Cycle Ledger (per source pod \u00d7 group)",
"type": "table"
},
- {
- "collapsed": false,
- "gridPos": {
- "h": 1,
- "w": 24,
- "x": 0,
- "y": 16
- },
- "id": 60,
- "panels": [],
- "title": "Topology: Pod-to-Pod Flows",
- "type": "row"
- },
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Streaming pod-to-pod flows (excludes tier-migration
edges; see the Migration Flows panel below for those). Each row is one directed
source→target per (group, operation). Pub side is the publisher container; Sub
side is the receiver. Units are per-second (rate over $__range). Pub msg/s and
Sub msg/s should match side-to-side; a populated Pub cell with an empty Sub
cell is signal — an uninstrumented side or missing scrape target. p99 latencies
are histogram-quantile o [...]
+ "description": "Per source pod: age of the last completed cycle and its
success flag.",
"fieldConfig": {
"defaults": {
"color": {
@@ -670,221 +627,87 @@
"cellOptions": {
"type": "auto"
},
- "filterable": false,
"inspect": false
},
- "decimals": 2,
- "mappings": [
- {
- "type": "special",
- "options": {
- "match": "nan",
- "result": {
- "text": "-",
- "index": 0
- }
- }
- }
- ],
- "noValue": "-",
+ "mappings": [],
"thresholds": {
"mode": "absolute",
"steps": [
{
- "color": "text",
+ "color": "green",
"value": null
}
]
- },
- "unit": "ops"
+ }
},
"overrides": [
{
"matcher": {
"id": "byName",
- "options": "Pub p99"
- },
- "properties": [
- {
- "id": "unit",
- "value": "s"
- },
- {
- "id": "decimals",
- "value": 3
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Sub p99"
+ "options": "Last run age"
},
"properties": [
{
"id": "unit",
"value": "s"
- },
- {
- "id": "decimals",
- "value": 3
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Pub B/s"
- },
- "properties": [
- {
- "id": "unit",
- "value": "Bps"
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Sub B/s"
- },
- "properties": [
- {
- "id": "unit",
- "value": "Bps"
}
]
},
{
"matcher": {
"id": "byName",
- "options": "Pub err/s"
+ "options": "Success"
},
"properties": [
{
- "id": "unit",
- "value": "ops"
- },
- {
- "id": "custom.cellOptions",
- "value": {
- "type": "color-text"
- }
- },
- {
- "id": "thresholds",
- "value": {
- "mode": "absolute",
- "steps": [
- {
- "color": "text",
- "value": null
- },
- {
- "color": "red",
- "value": 1e-06
+ "id": "mappings",
+ "value": [
+ {
+ "type": "value",
+ "options": {
+ "1": {
+ "text": "OK",
+ "color": "green",
+ "index": 0
+ },
+ "0": {
+ "text": "FAIL",
+ "color": "red",
+ "index": 1
+ }
}
- ]
- }
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Sub err/s"
- },
- "properties": [
- {
- "id": "unit",
- "value": "ops"
+ }
+ ]
},
{
"id": "custom.cellOptions",
"value": {
- "type": "color-text"
- }
- },
- {
- "id": "thresholds",
- "value": {
- "mode": "absolute",
- "steps": [
- {
- "color": "text",
- "value": null
- },
- {
- "color": "red",
- "value": 1e-06
- }
- ]
+ "type": "color-background",
+ "mode": "basic"
}
}
]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Source"
- },
- "properties": [
- {
- "id": "custom.filterable",
- "value": true
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Target"
- },
- "properties": [
- {
- "id": "custom.filterable",
- "value": true
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Operation"
- },
- "properties": [
- {
- "id": "custom.filterable",
- "value": true
- }
- ]
}
]
},
"gridPos": {
- "h": 13,
- "w": 24,
- "x": 0,
- "y": 17
+ "h": 8,
+ "w": 12,
+ "x": 12,
+ "y": 6
},
- "id": 61,
+ "id": 112,
"options": {
+ "showHeader": true,
"cellHeight": "sm",
"footer": {
- "countRows": false,
- "fields": "",
+ "show": false,
"reducer": [
"sum"
],
- "show": false
- },
- "showHeader": true,
- "sortBy": [
- {
- "desc": true,
- "displayName": "Pub msg/s"
- }
- ]
+ "countRows": false,
+ "fields": ""
+ }
},
"pluginVersion": "11.2.0",
"targets": [
@@ -895,12 +718,12 @@
},
"editorMode": "code",
"exemplar": false,
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_pub_total_finished{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) or
rate(banyandb_queue_pub_total_finished{job=~\"$job\", remote_role=~\"$role\",
remote_node=~\"($pod)\\\\..*\"}[$__range]), \"source\", \"$1\", \"pod_name\",
\"(.*)\"), \"target\", \"$1\", \"remote_node\", \"([^.:]+).*\"))",
- "format": "table",
+ "expr": "time() -
banyandb_lifecycle_last_run_timestamp_seconds{job=~\"$job\",
group=~\"$group\"}",
"instant": true,
"legendFormat": "__auto",
"range": false,
- "refId": "A"
+ "refId": "A",
+ "format": "table"
},
{
"datasource": {
@@ -909,139 +732,54 @@
},
"editorMode": "code",
"exemplar": false,
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_started{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role!~\"lifecycle\"}[$__range]) or
rate(banyandb_queue_sub_total_started{job=~\"$job\", remote_role=~\"$role\",
remote_node=~\"($pod)\\\\..*\", remote_role!~\"lifecycle\"}[$__range]),
\"source\", \"$1\", \"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\",
\"pod_name\", \"(.*)\"))",
- "format": "table",
+ "expr": "max by
(pod_name)(banyandb_lifecycle_last_run_success{job=~\"$job\",
group=~\"$group\"})",
"instant": true,
"legendFormat": "__auto",
"range": false,
- "refId": "B"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "expr": "histogram_quantile(0.99, sum by (le, source, target,
operation)
(label_replace(label_replace(rate(banyandb_queue_pub_total_latency_bucket{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) or
rate(banyandb_queue_pub_total_latency_bucket{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\"}[$__range]), \"source\",
\"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\", \"remote_node\",
\"([^.:]+).*\")))",
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "C"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "expr": "histogram_quantile(0.99, sum by (le, source, target,
operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_latency_bucket{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role!~\"lifecycle\"}[$__range]) or
rate(banyandb_queue_sub_total_latency_bucket{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\",
remote_role!~\"lifecycle\"}[$__range]), \"source\", \"$1\", \"remote_node\",
\"([^.:]+).*\"), \"target\", \"$1\", \ [...]
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "D"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_pub_total_err{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) or
rate(banyandb_queue_pub_total_err{job=~\"$job\", remote_role=~\"$role\",
remote_node=~\"($pod)\\\\..*\"}[$__range]), \"source\", \"$1\", \"pod_name\",
\"(.*)\"), \"target\", \"$1\", \"remote_node\", \"([^.:]+).*\"))",
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "E"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_err{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role!~\"lifecycle\"}[$__range]) or
rate(banyandb_queue_sub_total_err{job=~\"$job\", remote_role=~\"$role\",
remote_node=~\"($pod)\\\\..*\", remote_role!~\"lifecycle\"}[$__range]),
\"source\", \"$1\", \"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\",
\"pod_name\", \"(.*)\"))",
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "F"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_pub_sent_bytes{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) or
rate(banyandb_queue_pub_sent_bytes{job=~\"$job\", remote_role=~\"$role\",
remote_node=~\"($pod)\\\\..*\"}[$__range]), \"source\", \"$1\", \"pod_name\",
\"(.*)\"), \"target\", \"$1\", \"remote_node\", \"([^.:]+).*\"))",
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "G"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_received_bytes{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role!~\"lifecycle\"}[$__range]) or
rate(banyandb_queue_sub_received_bytes{job=~\"$job\", remote_role=~\"$role\",
remote_node=~\"($pod)\\\\..*\", remote_role!~\"lifecycle\"}[$__range]),
\"source\", \"$1\", \"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\",
\"pod_name\", \"(.*)\"))",
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "H"
+ "refId": "B",
+ "format": "table"
}
],
- "title": "Flows — Publisher vs Subscriber View",
"transformations": [
{
- "id": "merge",
- "options": {}
+ "id": "joinByField",
+ "options": {
+ "byField": "pod_name",
+ "mode": "outer"
+ }
},
{
"id": "organize",
"options": {
"excludeByName": {
- "Time": true
- },
- "indexByName": {
- "source": 0,
- "target": 1,
- "operation": 2,
- "Value #A": 3,
- "Value #B": 4,
- "Value #C": 5,
- "Value #D": 6,
- "Value #E": 7,
- "Value #F": 8,
- "Value #G": 9,
- "Value #H": 10
+ "Time": true,
+ "job": true,
+ "instance": true,
+ "pod": true,
+ "container_name": true,
+ "node_role": true,
+ "remote_role": true,
+ "remote_node": true,
+ "remote_tier": true,
+ "group": true,
+ "__name__": true
},
"renameByName": {
- "source": "Source",
- "target": "Target",
- "operation": "Operation",
- "Value #A": "Pub msg/s",
- "Value #B": "Sub msg/s",
- "Value #C": "Pub p99",
- "Value #D": "Sub p99",
- "Value #E": "Pub err/s",
- "Value #F": "Sub err/s",
- "Value #G": "Pub B/s",
- "Value #H": "Sub B/s"
+ "pod_name": "Src pod",
+ "node_type": "Tier",
+ "Value #A": "Last run age",
+ "Value #B": "Success"
+ },
+ "indexByName": {
+ "pod_name": 0,
+ "node_type": 1,
+ "Value #A": 2,
+ "Value #B": 3
}
}
}
],
+ "title": "Last Run per Source Pod",
"type": "table"
},
{
@@ -1049,7 +787,121 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Cross-tier lifecycle migration flows (hot→warm,
warm→cold). Each row is one directed source→target per (group, operation). Pub
side is the lifecycle sidecar inside the SOURCE data pod (publishes via
banyandb_lifecycle_migration_*); Sub side is the data pod RECEIVING (records
via banyandb_queue_sub_total_started{remote_role=\"lifecycle\"}).
Counts/bytes/errors are per-day (rate × 86400) because tier migration is a
daily-batch workload and an instant per-second rate i [...]
+ "description": "Completed migration cycles per source tier over each
interval (banyandb_lifecycle_cycles_total). Attributed to the source node only
\u2014 cycles_total has no destination dimension.",
+ "fieldConfig": {
+ "defaults": {
+ "color": {
+ "mode": "palette-classic"
+ },
+ "custom": {
+ "axisBorderShow": false,
+ "axisCenteredZero": false,
+ "axisColorMode": "text",
+ "axisLabel": "",
+ "axisPlacement": "auto",
+ "barAlignment": 0,
+ "barWidthFactor": 0.6,
+ "drawStyle": "line",
+ "fillOpacity": 10,
+ "gradientMode": "none",
+ "hideFrom": {
+ "legend": false,
+ "tooltip": false,
+ "viz": false
+ },
+ "insertNulls": false,
+ "lineInterpolation": "linear",
+ "lineWidth": 1,
+ "pointSize": 5,
+ "scaleDistribution": {
+ "type": "linear"
+ },
+ "showPoints": "auto",
+ "spanNulls": false,
+ "stacking": {
+ "group": "A",
+ "mode": "none"
+ },
+ "thresholdsStyle": {
+ "mode": "off"
+ }
+ },
+ "mappings": [],
+ "thresholds": {
+ "mode": "absolute",
+ "steps": [
+ {
+ "color": "green",
+ "value": null
+ }
+ ]
+ },
+ "unit": "short"
+ },
+ "overrides": []
+ },
+ "gridPos": {
+ "h": 7,
+ "w": 24,
+ "x": 0,
+ "y": 14
+ },
+ "id": 113,
+ "options": {
+ "legend": {
+ "calcs": [
+ "lastNotNull",
+ "max",
+ "mean"
+ ],
+ "displayMode": "table",
+ "placement": "bottom",
+ "showLegend": true,
+ "sortBy": "Last *",
+ "sortDesc": true
+ },
+ "tooltip": {
+ "mode": "multi",
+ "sort": "desc"
+ }
+ },
+ "pluginVersion": "11.2.0",
+ "targets": [
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "expr": "sum by
(node_type)(increase(banyandb_lifecycle_cycles_total{job=~\"$job\",
group=~\"$group\"}[$__interval]))",
+ "instant": false,
+ "legendFormat": "{{node_type}} tier",
+ "range": true,
+ "refId": "A"
+ }
+ ],
+ "title": "Cycles Over Time (by source tier)",
+ "type": "timeseries"
+ },
+ {
+ "collapsed": false,
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 21
+ },
+ "id": 120,
+ "panels": [],
+ "title": "Migration Flows",
+ "type": "row"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "description": "Cross-tier lifecycle migration flows (hot\u2192warm,
warm\u2192cold). Each row is one directed source\u2192target per (group,
operation). Pub side is the lifecycle sidecar inside the SOURCE data pod
(publishes via banyandb_lifecycle_migration_*); Sub side is the data pod
RECEIVING (records via banyandb_queue_sub_total_message_started for
batch-write, banyandb_queue_sub_total_finished for file-sync, and
total_batch_started{remote_role=\"lifecycle\"}). Counts/bytes/er [...]
"fieldConfig": {
"defaults": {
"color": {
@@ -1367,714 +1219,48 @@
"properties": [
{
"id": "custom.filterable",
- "value": true
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Operation"
- },
- "properties": [
- {
- "id": "custom.filterable",
- "value": true
- }
- ]
- }
- ]
- },
- "gridPos": {
- "h": 13,
- "w": 24,
- "x": 0,
- "y": 30
- },
- "id": 62,
- "options": {
- "cellHeight": "sm",
- "footer": {
- "countRows": false,
- "fields": "",
- "reducer": [
- "sum"
- ],
- "show": false
- },
- "showHeader": true,
- "sortBy": [
- {
- "desc": true,
- "displayName": "Pub msg/s"
- }
- ]
- },
- "pluginVersion": "11.2.0",
- "targets": [
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "A",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_total_finished{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) * 86400 or
rate(banyandb_lifecycle_migration_total_finished{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400,
\"source\", \"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\",
\"remote_node\", \"([^.:]+).*\"))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "B",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_started{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role=\"lifecycle\"}[$__range]) * 86400 or
rate(banyandb_queue_sub_total_started{job=~\"$job\", remote_role=\"lifecycle\",
remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400, \"source\", \"$1\",
\"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\", \"pod_name\", \"(.*)\"))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "C",
- "expr": "histogram_quantile(0.99, sum by (le, source, target,
operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_total_latency_bucket{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) or
rate(banyandb_lifecycle_migration_total_latency_bucket{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\"}[$__range]), \"source\",
\"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\", \"remote_node\",
\"([^.:]+).*\")))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "D",
- "expr": "histogram_quantile(0.99, sum by (le, source, target,
operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_latency_bucket{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role=\"lifecycle\"}[$__range]) or
rate(banyandb_queue_sub_total_latency_bucket{job=~\"$job\",
remote_role=\"lifecycle\", remote_node=~\"($pod)\\\\..*\"}[$__range]),
\"source\", \"$1\", \"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\",
\"pod_name\", \"(.*)\")))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "E",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_total_err{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) * 86400 or
rate(banyandb_lifecycle_migration_total_err{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400,
\"source\", \"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\",
\"remote_node\", \"([^.:]+).*\"))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "F",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_err{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role=\"lifecycle\"}[$__range]) * 86400 or
rate(banyandb_queue_sub_total_err{job=~\"$job\", remote_role=\"lifecycle\",
remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400, \"source\", \"$1\",
\"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\", \"pod_name\", \"(.*)\"))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "G",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_sent_bytes{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) * 86400 or
rate(banyandb_lifecycle_migration_sent_bytes{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400,
\"source\", \"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\",
\"remote_node\", \"([^.:]+).*\"))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "H",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_received_bytes{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role=\"lifecycle\"}[$__range]) * 86400 or
rate(banyandb_queue_sub_received_bytes{job=~\"$job\",
remote_role=\"lifecycle\", remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400,
\"source\", \"$1\", \"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\",
\"pod_name\", \"(.*)\"))"
- }
- ],
- "title": "Migration Flows — Tier Migrations (lifecycle)",
- "transformations": [
- {
- "id": "merge",
- "options": {}
- },
- {
- "id": "organize",
- "options": {
- "excludeByName": {
- "Time": true
- },
- "indexByName": {
- "source": 0,
- "target": 1,
- "operation": 2,
- "Value #A": 3,
- "Value #B": 4,
- "Value #C": 5,
- "Value #D": 6,
- "Value #E": 7,
- "Value #F": 8,
- "Value #G": 9,
- "Value #H": 10
- },
- "renameByName": {
- "source": "Source",
- "target": "Target",
- "operation": "Operation",
- "Value #A": "Pub msg/day",
- "Value #B": "Sub msg/day",
- "Value #C": "Pub p99",
- "Value #D": "Sub p99",
- "Value #E": "Pub err/day",
- "Value #F": "Sub err/day",
- "Value #G": "Pub B/day",
- "Value #H": "Sub B/day"
- }
- }
- }
- ],
- "type": "table"
- },
- {
- "collapsed": false,
- "gridPos": {
- "h": 1,
- "w": 24,
- "x": 0,
- "y": 43
- },
- "id": 13,
- "panels": [],
- "title": "Resources",
- "type": "row"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "description": "CPU cores used per node (rate of
process_cpu_seconds_total).",
- "fieldConfig": {
- "defaults": {
- "color": {
- "mode": "palette-classic"
- },
- "custom": {
- "axisBorderShow": false,
- "axisCenteredZero": false,
- "axisColorMode": "text",
- "axisLabel": "",
- "axisPlacement": "auto",
- "barAlignment": 0,
- "barWidthFactor": 0.6,
- "drawStyle": "line",
- "fillOpacity": 10,
- "gradientMode": "none",
- "hideFrom": {
- "legend": false,
- "tooltip": false,
- "viz": false
- },
- "insertNulls": false,
- "lineInterpolation": "linear",
- "lineWidth": 1,
- "pointSize": 5,
- "scaleDistribution": {
- "type": "linear"
- },
- "showPoints": "auto",
- "spanNulls": false,
- "stacking": {
- "group": "A",
- "mode": "none"
- },
- "thresholdsStyle": {
- "mode": "off"
- }
- },
- "mappings": [],
- "thresholds": {
- "mode": "absolute",
- "steps": [
- {
- "color": "green",
- "value": null
- }
- ]
- },
- "unit": "percentunit"
- },
- "overrides": []
- },
- "gridPos": {
- "h": 8,
- "w": 12,
- "x": 0,
- "y": 44
- },
- "id": 14,
- "options": {
- "legend": {
- "calcs": [
- "lastNotNull",
- "max",
- "mean"
- ],
- "displayMode": "table",
- "placement": "bottom",
- "showLegend": true,
- "sortBy": "Last *",
- "sortDesc": true
- },
- "tooltip": {
- "mode": "multi",
- "sort": "desc"
- }
- },
- "pluginVersion": "11.2.0",
- "targets": [
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "expr": "sum(rate(process_cpu_seconds_total{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__rate_interval])) by
(pod_name)",
- "instant": false,
- "legendFormat": "{{pod_name}}",
- "range": true,
- "refId": "A"
- }
- ],
- "title": "CPU Usage",
- "type": "timeseries"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "description": "Resident set size per node.",
- "fieldConfig": {
- "defaults": {
- "color": {
- "mode": "palette-classic"
- },
- "custom": {
- "axisBorderShow": false,
- "axisCenteredZero": false,
- "axisColorMode": "text",
- "axisLabel": "",
- "axisPlacement": "auto",
- "barAlignment": 0,
- "barWidthFactor": 0.6,
- "drawStyle": "line",
- "fillOpacity": 10,
- "gradientMode": "none",
- "hideFrom": {
- "legend": false,
- "tooltip": false,
- "viz": false
- },
- "insertNulls": false,
- "lineInterpolation": "linear",
- "lineWidth": 1,
- "pointSize": 5,
- "scaleDistribution": {
- "type": "linear"
- },
- "showPoints": "auto",
- "spanNulls": false,
- "stacking": {
- "group": "A",
- "mode": "none"
- },
- "thresholdsStyle": {
- "mode": "off"
- }
- },
- "mappings": [],
- "thresholds": {
- "mode": "absolute",
- "steps": [
- {
- "color": "green",
- "value": null
- }
- ]
- },
- "unit": "bytes"
- },
- "overrides": []
- },
- "gridPos": {
- "h": 8,
- "w": 12,
- "x": 12,
- "y": 44
- },
- "id": 15,
- "options": {
- "legend": {
- "calcs": [
- "lastNotNull",
- "max",
- "mean"
- ],
- "displayMode": "table",
- "placement": "bottom",
- "showLegend": true,
- "sortBy": "Last *",
- "sortDesc": true
- },
- "tooltip": {
- "mode": "multi",
- "sort": "desc"
- }
- },
- "pluginVersion": "11.2.0",
- "targets": [
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "expr": "max by (pod_name)
(max_over_time(process_resident_memory_bytes{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__rate_interval]))",
- "instant": false,
- "legendFormat": "{{pod_name}}",
- "range": true,
- "refId": "A"
- }
- ],
- "title": "RSS Memory",
- "type": "timeseries"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "description": "System memory used percent per node (from
kind=used_percent).",
- "fieldConfig": {
- "defaults": {
- "color": {
- "mode": "palette-classic"
- },
- "custom": {
- "axisBorderShow": false,
- "axisCenteredZero": false,
- "axisColorMode": "text",
- "axisLabel": "",
- "axisPlacement": "auto",
- "barAlignment": 0,
- "barWidthFactor": 0.6,
- "drawStyle": "line",
- "fillOpacity": 10,
- "gradientMode": "none",
- "hideFrom": {
- "legend": false,
- "tooltip": false,
- "viz": false
- },
- "insertNulls": false,
- "lineInterpolation": "linear",
- "lineWidth": 1,
- "pointSize": 5,
- "scaleDistribution": {
- "type": "linear"
- },
- "showPoints": "auto",
- "spanNulls": false,
- "stacking": {
- "group": "A",
- "mode": "none"
- },
- "thresholdsStyle": {
- "mode": "off"
- }
- },
- "mappings": [],
- "thresholds": {
- "mode": "absolute",
- "steps": [
- {
- "color": "green",
- "value": null
- },
- {
- "color": "red",
- "value": 80
- }
- ]
- },
- "unit": "percent"
- },
- "overrides": []
- },
- "gridPos": {
- "h": 8,
- "w": 12,
- "x": 0,
- "y": 52
- },
- "id": 16,
- "options": {
- "legend": {
- "calcs": [
- "lastNotNull",
- "max",
- "mean"
- ],
- "displayMode": "table",
- "placement": "bottom",
- "showLegend": true,
- "sortBy": "Last *",
- "sortDesc": true
- },
- "tooltip": {
- "mode": "multi",
- "sort": "desc"
- }
- },
- "pluginVersion": "11.2.0",
- "targets": [
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "expr": "max(banyandb_system_memory_state{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",kind=\"used_percent\"}) by
(pod_name)",
- "instant": false,
- "legendFormat": "{{pod_name}}",
- "range": true,
- "refId": "A"
- }
- ],
- "title": "System Memory %",
- "type": "timeseries"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "description": "Disk used / total per node (aggregated across all
storage paths).",
- "fieldConfig": {
- "defaults": {
- "color": {
- "mode": "palette-classic"
- },
- "custom": {
- "axisBorderShow": false,
- "axisCenteredZero": false,
- "axisColorMode": "text",
- "axisLabel": "",
- "axisPlacement": "auto",
- "barAlignment": 0,
- "barWidthFactor": 0.6,
- "drawStyle": "line",
- "fillOpacity": 10,
- "gradientMode": "none",
- "hideFrom": {
- "legend": false,
- "tooltip": false,
- "viz": false
- },
- "insertNulls": false,
- "lineInterpolation": "linear",
- "lineWidth": 1,
- "pointSize": 5,
- "scaleDistribution": {
- "type": "linear"
- },
- "showPoints": "auto",
- "spanNulls": false,
- "stacking": {
- "group": "A",
- "mode": "none"
- },
- "thresholdsStyle": {
- "mode": "off"
- }
- },
- "mappings": [],
- "thresholds": {
- "mode": "absolute",
- "steps": [
- {
- "color": "green",
- "value": null
- },
- {
- "color": "red",
- "value": 0.8
- }
- ]
- },
- "unit": "percentunit"
- },
- "overrides": []
- },
- "gridPos": {
- "h": 8,
- "w": 12,
- "x": 12,
- "y": 52
- },
- "id": 17,
- "options": {
- "legend": {
- "calcs": [
- "lastNotNull",
- "max",
- "mean"
- ],
- "displayMode": "table",
- "placement": "bottom",
- "showLegend": true,
- "sortBy": "Last *",
- "sortDesc": true
- },
- "tooltip": {
- "mode": "multi",
- "sort": "desc"
- }
- },
- "pluginVersion": "11.2.0",
- "targets": [
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "expr": "sum(banyandb_system_disk{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",kind=\"used\"}) by (pod_name) /
sum(banyandb_system_disk{job=~\"$job\", container_name=~\"$role\",
pod_name=~\"$pod\",kind=\"total\"}) by (pod_name)",
- "instant": false,
- "legendFormat": "{{pod_name}}",
- "range": true,
- "refId": "A"
- }
- ],
- "title": "Disk Usage %",
- "type": "timeseries"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "description": "Per-node NIC throughput (received / sent).",
- "fieldConfig": {
- "defaults": {
- "color": {
- "mode": "palette-classic"
- },
- "custom": {
- "axisBorderShow": false,
- "axisCenteredZero": false,
- "axisColorMode": "text",
- "axisLabel": "",
- "axisPlacement": "auto",
- "barAlignment": 0,
- "barWidthFactor": 0.6,
- "drawStyle": "line",
- "fillOpacity": 10,
- "gradientMode": "none",
- "hideFrom": {
- "legend": false,
- "tooltip": false,
- "viz": false
- },
- "insertNulls": false,
- "lineInterpolation": "linear",
- "lineWidth": 1,
- "pointSize": 5,
- "scaleDistribution": {
- "type": "linear"
- },
- "showPoints": "auto",
- "spanNulls": false,
- "stacking": {
- "group": "A",
- "mode": "none"
- },
- "thresholdsStyle": {
- "mode": "off"
- }
- },
- "mappings": [],
- "thresholds": {
- "mode": "absolute",
- "steps": [
- {
- "color": "green",
- "value": null
+ "value": true
}
]
},
- "unit": "binBps"
- },
- "overrides": []
+ {
+ "matcher": {
+ "id": "byName",
+ "options": "Operation"
+ },
+ "properties": [
+ {
+ "id": "custom.filterable",
+ "value": true
+ }
+ ]
+ }
+ ]
},
"gridPos": {
- "h": 8,
+ "h": 13,
"w": 24,
"x": 0,
- "y": 60
+ "y": 22
},
- "id": 18,
+ "id": 121,
"options": {
- "legend": {
- "calcs": [
- "lastNotNull",
- "max",
- "mean"
+ "cellHeight": "sm",
+ "footer": {
+ "countRows": false,
+ "fields": "",
+ "reducer": [
+ "sum"
],
- "displayMode": "table",
- "placement": "bottom",
- "showLegend": true,
- "sortBy": "Last *",
- "sortDesc": true
+ "show": false
},
- "tooltip": {
- "mode": "multi",
- "sort": "desc"
- }
+ "showHeader": true,
+ "sortBy": [
+ {
+ "desc": true,
+ "displayName": "Pub msg/day"
+ }
+ ]
},
"pluginVersion": "11.2.0",
"targets": [
@@ -2084,11 +1270,13 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(rate(banyandb_system_net_state{job=~\"$job\",
container_name=~\"$role\",
pod_name=~\"$pod\",kind=\"bytes_recv\"}[$__rate_interval])) by (pod_name,
name)",
- "instant": false,
- "legendFormat": "{{pod_name}} {{name}} recv",
- "range": true,
- "refId": "A"
+ "exemplar": false,
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_total_finished{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\"}[$__range]) *86400 , \"source\",
\"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\", \"remote_node\",
\"([^.:]+).*\"))",
+ "instant": true,
+ "legendFormat": "__auto",
+ "range": false,
+ "refId": "A",
+ "format": "table"
},
{
"datasource": {
@@ -2096,15 +1284,173 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(rate(banyandb_system_net_state{job=~\"$job\",
container_name=~\"$role\",
pod_name=~\"$pod\",kind=\"bytes_sent\"}[$__rate_interval])) by (pod_name,
name)",
- "instant": false,
- "legendFormat": "{{pod_name}} {{name}} sent",
- "range": true,
- "refId": "B"
+ "exemplar": false,
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_message_started{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\",
remote_role=\"lifecycle\"}[$__range]) *86400 or
rate(banyandb_queue_sub_total_finished{job=~\"$job\", group=~\"$group\",
remote_role=\"lifecycle\", operation=~\"$operation\",
operation=\"file-sync\"}[$__range]) *86400, \"source\", \"$1\",
\"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\", \"pod_nam [...]
+ "instant": true,
+ "legendFormat": "__auto",
+ "range": false,
+ "refId": "B",
+ "format": "table"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "exemplar": false,
+ "expr": "histogram_quantile(0.99, sum by (le, source, target,
operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_total_latency_bucket{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\"}[$__range]), \"source\", \"$1\",
\"pod_name\", \"(.*)\"), \"target\", \"$1\", \"remote_node\",
\"([^.:]+).*\")))",
+ "instant": true,
+ "legendFormat": "__auto",
+ "range": false,
+ "refId": "C",
+ "format": "table"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "exemplar": false,
+ "expr": "histogram_quantile(0.99, sum by (le, source, target,
operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_latency_bucket{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\",
remote_role=\"lifecycle\"}[$__range]), \"source\", \"$1\", \"remote_node\",
\"([^.:]+).*\"), \"target\", \"$1\", \"pod_name\", \"(.*)\")))",
+ "instant": true,
+ "legendFormat": "__auto",
+ "range": false,
+ "refId": "D",
+ "format": "table"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "exemplar": false,
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_total_err{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\"}[$__range]) *86400 , \"source\",
\"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\", \"remote_node\",
\"([^.:]+).*\"))",
+ "instant": true,
+ "legendFormat": "__auto",
+ "range": false,
+ "refId": "E",
+ "format": "table"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "exemplar": false,
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_err{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\",
remote_role=\"lifecycle\"}[$__range]) *86400 , \"source\", \"$1\",
\"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\", \"pod_name\", \"(.*)\"))",
+ "instant": true,
+ "legendFormat": "__auto",
+ "range": false,
+ "refId": "F",
+ "format": "table"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "exemplar": false,
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_sent_bytes{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\"}[$__range]) *86400 , \"source\",
\"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\", \"remote_node\",
\"([^.:]+).*\"))",
+ "instant": true,
+ "legendFormat": "__auto",
+ "range": false,
+ "refId": "G",
+ "format": "table"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "exemplar": false,
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_received_bytes{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\",
remote_role=\"lifecycle\"}[$__range]) *86400 , \"source\", \"$1\",
\"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\", \"pod_name\", \"(.*)\"))",
+ "instant": true,
+ "legendFormat": "__auto",
+ "range": false,
+ "refId": "H",
+ "format": "table"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "exemplar": false,
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_total_batch_finished{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\"}[$__range]) *86400 , \"source\",
\"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\", \"remote_node\",
\"([^.:]+).*\"))",
+ "instant": true,
+ "legendFormat": "__auto",
+ "range": false,
+ "refId": "I",
+ "format": "table"
+ },
+ {
+ "datasource": {
+ "type": "prometheus",
+ "uid": "${DS_PROMETHEUS}"
+ },
+ "editorMode": "code",
+ "exemplar": false,
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_batch_started{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\",
remote_role=\"lifecycle\"}[$__range]) *86400 , \"source\", \"$1\",
\"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\", \"pod_name\", \"(.*)\"))",
+ "instant": true,
+ "legendFormat": "__auto",
+ "range": false,
+ "refId": "J",
+ "format": "table"
}
],
- "title": "Network Usage",
- "type": "timeseries"
+ "title": "Migration Flows \u2014 Pub vs Sub (per flow)",
+ "transformations": [
+ {
+ "id": "merge",
+ "options": {}
+ },
+ {
+ "id": "organize",
+ "options": {
+ "excludeByName": {
+ "Time": true
+ },
+ "indexByName": {
+ "source": 0,
+ "target": 1,
+ "operation": 2,
+ "Value #A": 3,
+ "Value #B": 4,
+ "Value #C": 5,
+ "Value #D": 6,
+ "Value #E": 7,
+ "Value #F": 8,
+ "Value #G": 9,
+ "Value #H": 10,
+ "Value #I": 11,
+ "Value #J": 12
+ },
+ "renameByName": {
+ "source": "Source",
+ "target": "Target",
+ "operation": "Operation",
+ "Value #A": "Pub msg/day",
+ "Value #B": "Sub msg/day",
+ "Value #C": "Pub p99",
+ "Value #D": "Sub p99",
+ "Value #E": "Pub err/day",
+ "Value #F": "Sub err/day",
+ "Value #G": "Pub B/day",
+ "Value #H": "Sub B/day",
+ "Value #I": "Pub batch/day",
+ "Value #J": "Sub batch/day"
+ }
+ }
+ }
+ ],
+ "type": "table"
},
{
"collapsed": false,
@@ -2112,11 +1458,11 @@
"h": 1,
"w": 24,
"x": 0,
- "y": 68
+ "y": 35
},
- "id": 19,
+ "id": 130,
"panels": [],
- "title": "Disk by Path",
+ "title": "Throughput & Volume",
"type": "row"
},
{
@@ -2124,7 +1470,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "On-disk usage per node and storage path.",
+ "description": "Publisher sent-bytes rate per flow and operation.",
"fieldConfig": {
"defaults": {
"color": {
@@ -2173,17 +1519,17 @@
}
]
},
- "unit": "bytes"
+ "unit": "Bps"
},
"overrides": []
},
"gridPos": {
"h": 8,
- "w": 12,
+ "w": 8,
"x": 0,
- "y": 69
+ "y": 36
},
- "id": 20,
+ "id": 131,
"options": {
"legend": {
"calcs": [
@@ -2210,14 +1556,14 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(banyandb_system_disk{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",kind=\"used\"}) by (pod_name,
path)",
+ "expr": "sum by (node_type, remote_tier,
operation)(rate(banyandb_lifecycle_migration_sent_bytes{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\"}[$__rate_interval]))",
"instant": false,
- "legendFormat": "{{pod_name}} {{path}}",
+ "legendFormat": "{{node_type}}\u2192{{remote_tier}} {{operation}}",
"range": true,
"refId": "A"
}
],
- "title": "Disk Used by Path",
+ "title": "Bytes/s by Flow",
"type": "timeseries"
},
{
@@ -2225,7 +1571,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Disk capacity per node and storage path.",
+ "description": "Per-message replay rate for batch-write migration.",
"fieldConfig": {
"defaults": {
"color": {
@@ -2274,17 +1620,17 @@
}
]
},
- "unit": "bytes"
+ "unit": "ops"
},
"overrides": []
},
"gridPos": {
"h": 8,
- "w": 12,
- "x": 12,
- "y": 69
+ "w": 8,
+ "x": 8,
+ "y": 36
},
- "id": 21,
+ "id": 132,
"options": {
"legend": {
"calcs": [
@@ -2311,14 +1657,14 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(banyandb_system_disk{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",kind=\"total\"}) by (pod_name,
path)",
+ "expr": "sum by (node_type,
remote_tier)(rate(banyandb_lifecycle_migration_total_finished{job=~\"$job\",
group=~\"$group\", operation=\"batch-write\"}[$__rate_interval]))",
"instant": false,
- "legendFormat": "{{pod_name}} {{path}}",
+ "legendFormat": "{{node_type}}\u2192{{remote_tier}}",
"range": true,
"refId": "A"
}
],
- "title": "Disk Total by Path",
+ "title": "Docs Replayed/s (batch-write)",
"type": "timeseries"
},
{
@@ -2326,7 +1672,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Disk used percent per node and storage path.",
+ "description": "Per-part chunked-sync completion rate for file-sync
migration.",
"fieldConfig": {
"defaults": {
"color": {
@@ -2372,24 +1718,20 @@
{
"color": "green",
"value": null
- },
- {
- "color": "red",
- "value": 0.8
}
]
},
- "unit": "percentunit"
+ "unit": "ops"
},
"overrides": []
},
"gridPos": {
"h": 8,
- "w": 24,
- "x": 0,
- "y": 77
+ "w": 8,
+ "x": 16,
+ "y": 36
},
- "id": 22,
+ "id": 133,
"options": {
"legend": {
"calcs": [
@@ -2416,14 +1758,14 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(banyandb_system_disk{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",kind=\"used\"}) by (pod_name,
path) / sum(banyandb_system_disk{job=~\"$job\", container_name=~\"$role\",
pod_name=~\"$pod\",kind=\"total\"}) by (pod_name, path)",
+ "expr": "sum by (node_type,
remote_tier)(rate(banyandb_lifecycle_migration_total_finished{job=~\"$job\",
group=~\"$group\", operation=\"file-sync\"}[$__rate_interval]))",
"instant": false,
- "legendFormat": "{{pod_name}} {{path}}",
+ "legendFormat": "{{node_type}}\u2192{{remote_tier}}",
"range": true,
"refId": "A"
}
],
- "title": "Disk Used % by Path",
+ "title": "Parts Synced/s (file-sync)",
"type": "timeseries"
},
{
@@ -2432,11 +1774,11 @@
"h": 1,
"w": 24,
"x": 0,
- "y": 85
+ "y": 44
},
- "id": 50,
+ "id": 140,
"panels": [],
- "title": "Go Runtime",
+ "title": "Latency",
"type": "row"
},
{
@@ -2444,7 +1786,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Goroutine count per node.",
+ "description": "p99 of banyandb_lifecycle_migration_total_latency by
operation and flow.",
"fieldConfig": {
"defaults": {
"color": {
@@ -2493,7 +1835,7 @@
}
]
},
- "unit": "short"
+ "unit": "s"
},
"overrides": []
},
@@ -2501,9 +1843,9 @@
"h": 8,
"w": 12,
"x": 0,
- "y": 86
+ "y": 45
},
- "id": 51,
+ "id": 141,
"options": {
"legend": {
"calcs": [
@@ -2530,14 +1872,14 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(go_goroutines{job=~\"$job\", container_name=~\"$role\",
pod_name=~\"$pod\"}) by (pod_name)",
+ "expr": "histogram_quantile(0.99, sum by (le, operation, node_type,
remote_tier)(rate(banyandb_lifecycle_migration_total_latency_bucket{job=~\"$job\",
group=~\"$group\", operation=~\"$operation\"}[$__rate_interval])))",
"instant": false,
- "legendFormat": "{{pod_name}}",
+ "legendFormat": "{{node_type}}\u2192{{remote_tier}} {{operation}}",
"range": true,
"refId": "A"
}
],
- "title": "Goroutines",
+ "title": "p99 Migration Latency",
"type": "timeseries"
},
{
@@ -2545,7 +1887,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Average GC pause duration per node.",
+ "description": "p99 of banyandb_lifecycle_migration_total_batch_latency
per flow.",
"fieldConfig": {
"defaults": {
"color": {
@@ -2602,9 +1944,9 @@
"h": 8,
"w": 12,
"x": 12,
- "y": 86
+ "y": 45
},
- "id": 52,
+ "id": 142,
"options": {
"legend": {
"calcs": [
@@ -2631,22 +1973,35 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(rate(go_gc_duration_seconds_sum{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__rate_interval])) by
(pod_name) / sum(rate(go_gc_duration_seconds_count{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__rate_interval])) by
(pod_name)",
+ "expr": "histogram_quantile(0.99, sum by (le, node_type,
remote_tier)(rate(banyandb_lifecycle_migration_total_batch_latency_bucket{job=~\"$job\",
group=~\"$group\"}[$__rate_interval])))",
"instant": false,
- "legendFormat": "{{pod_name}}",
+ "legendFormat": "{{node_type}}\u2192{{remote_tier}}",
"range": true,
"refId": "A"
}
],
- "title": "GC Pause (avg)",
+ "title": "p99 Batch Latency (batch-write)",
"type": "timeseries"
},
+ {
+ "collapsed": false,
+ "gridPos": {
+ "h": 1,
+ "w": 24,
+ "x": 0,
+ "y": 53
+ },
+ "id": 150,
+ "panels": [],
+ "title": "Errors & Integrity",
+ "type": "row"
+ },
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "In-use heap bytes per node.",
+ "description": "Lifecycle migration error rate by error_type and
operation. Empty when there are no errors.",
"fieldConfig": {
"defaults": {
"color": {
@@ -2692,10 +2047,14 @@
{
"color": "green",
"value": null
+ },
+ {
+ "color": "red",
+ "value": 1
}
]
},
- "unit": "bytes"
+ "unit": "ops"
},
"overrides": []
},
@@ -2703,9 +2062,9 @@
"h": 8,
"w": 12,
"x": 0,
- "y": 94
+ "y": 54
},
- "id": 53,
+ "id": 151,
"options": {
"legend": {
"calcs": [
@@ -2732,14 +2091,14 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(go_memstats_heap_inuse_bytes{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}) by (pod_name)",
+ "expr": "sum by (error_type,
operation)(rate(banyandb_lifecycle_migration_total_err{job=~\"$job\",
group=~\"$group\"}[$__rate_interval]))",
"instant": false,
- "legendFormat": "{{pod_name}}",
+ "legendFormat": "{{operation}} {{error_type}}",
"range": true,
"refId": "A"
}
],
- "title": "Heap In-Use",
+ "title": "Errors/s by Type",
"type": "timeseries"
},
{
@@ -2747,7 +2106,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Heap next-GC threshold and allocation rate per node.",
+ "description": "Publisher-finished minus subscriber-received per
operation; should hover at 0. Non-zero means a side dropped or double-counted.",
"fieldConfig": {
"defaults": {
"color": {
@@ -2796,30 +2155,17 @@
}
]
},
- "unit": "bytes"
+ "unit": "ops"
},
- "overrides": [
- {
- "matcher": {
- "id": "byFrameRefID",
- "options": "B"
- },
- "properties": [
- {
- "id": "unit",
- "value": "Bps"
- }
- ]
- }
- ]
+ "overrides": []
},
"gridPos": {
"h": 8,
"w": 12,
"x": 12,
- "y": 94
+ "y": 54
},
- "id": 54,
+ "id": 152,
"options": {
"legend": {
"calcs": [
@@ -2846,9 +2192,9 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(go_memstats_next_gc_bytes{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}) by (pod_name)",
+ "expr":
"sum(rate(banyandb_lifecycle_migration_total_finished{job=~\"$job\",
group=~\"$group\", operation=\"batch-write\"}[$__rate_interval])) -
sum(rate(banyandb_queue_sub_total_message_started{job=~\"$job\",
group=~\"$group\", remote_role=\"lifecycle\",
operation=\"batch-write\"}[$__rate_interval]))",
"instant": false,
- "legendFormat": "next_gc {{pod_name}}",
+ "legendFormat": "batch-write drift",
"range": true,
"refId": "A"
},
@@ -2858,20 +2204,23 @@
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
- "expr": "sum(rate(go_memstats_alloc_bytes_total{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__rate_interval])) by
(pod_name)",
+ "expr":
"sum(rate(banyandb_lifecycle_migration_total_finished{job=~\"$job\",
group=~\"$group\", operation=\"file-sync\"}[$__rate_interval])) -
sum(rate(banyandb_queue_sub_total_finished{job=~\"$job\", group=~\"$group\",
remote_role=\"lifecycle\", operation=\"file-sync\"}[$__rate_interval]))",
"instant": false,
- "legendFormat": "alloc_rate {{pod_name}}",
+ "legendFormat": "file-sync drift",
"range": true,
"refId": "B"
}
],
- "title": "Heap Next-GC / Alloc Rate",
+ "title": "Pub\u2194Sub Drift",
"type": "timeseries"
}
],
+ "refresh": "5m",
"schemaVersion": 39,
"tags": [
"banyandb",
+ "lifecycle",
+ "migration",
"fodc"
],
"templating": {
@@ -2881,13 +2230,11 @@
"hide": 0,
"includeAll": false,
"label": "Prometheus",
- "multi": false,
"name": "DS_PROMETHEUS",
"options": [],
"query": "prometheus",
"refresh": 1,
"regex": "",
- "skipUrlSync": false,
"type": "datasource"
},
{
@@ -2896,21 +2243,20 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "definition": "label_values(banyandb_system_up_time,job)",
+ "definition":
"label_values(banyandb_lifecycle_migration_total_finished,job)",
"hide": 0,
"includeAll": false,
- "multi": false,
+ "label": "job",
"name": "job",
"options": [],
"query": {
"qryType": 1,
- "query": "label_values(banyandb_system_up_time,job)",
+ "query":
"label_values(banyandb_lifecycle_migration_total_finished,job)",
"refId": "PrometheusVariableQueryEditor-VariableQuery"
},
- "refresh": 1,
+ "refresh": 2,
"regex": "",
- "skipUrlSync": false,
- "sort": 0,
+ "sort": 1,
"type": "query"
},
{
@@ -2919,23 +2265,23 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "definition":
"label_values(banyandb_system_up_time{job=~\"$job\"},container_name)",
+ "definition":
"label_values(banyandb_lifecycle_migration_total_finished{job=~\"$job\"},group)",
"hide": 0,
"includeAll": true,
+ "allValue": ".*",
+ "label": "group",
+ "name": "group",
"multi": true,
- "name": "role",
"options": [],
"query": {
"qryType": 1,
- "query":
"label_values(banyandb_system_up_time{job=~\"$job\"},container_name)",
+ "query":
"label_values(banyandb_lifecycle_migration_total_finished{job=~\"$job\"},group)",
"refId": "PrometheusVariableQueryEditor-VariableQuery"
},
- "refresh": 1,
+ "refresh": 2,
"regex": "",
- "skipUrlSync": false,
"sort": 1,
- "type": "query",
- "allValue": ".+"
+ "type": "query"
},
{
"current": {},
@@ -2943,34 +2289,34 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "definition":
"label_values(banyandb_system_up_time{job=~\"$job\",container_name=~\"$role\"},pod_name)",
+ "definition":
"label_values(banyandb_lifecycle_migration_total_finished{job=~\"$job\"},operation)",
"hide": 0,
"includeAll": true,
+ "allValue": ".*",
+ "label": "operation",
+ "name": "operation",
"multi": true,
- "name": "pod",
"options": [],
"query": {
"qryType": 1,
- "query":
"label_values(banyandb_system_up_time{job=~\"$job\",container_name=~\"$role\"},pod_name)",
+ "query":
"label_values(banyandb_lifecycle_migration_total_finished{job=~\"$job\"},operation)",
"refId": "PrometheusVariableQueryEditor-VariableQuery"
},
- "refresh": 1,
+ "refresh": 2,
"regex": "",
- "skipUrlSync": false,
"sort": 1,
- "type": "query",
- "allValue": ".+"
+ "type": "query"
}
]
},
"time": {
- "from": "now-24h",
+ "from": "now-2d",
"to": "now"
},
"timepicker": {},
- "timezone": "browser",
- "title": "BanyanDB Cluster — Nodes (FODC Proxy)",
- "uid": "banyandb-fodc-nodes",
+ "timezone": "",
+ "title": "BanyanDB \u2014 Lifecycle Migration",
+ "uid": "banyandb-fodc-migration",
"version": 1,
"weekStart": ""
-}
+}
\ No newline at end of file
diff --git a/docs/operation/grafana-fodc-nodes.json
b/docs/operation/grafana-fodc-nodes.json
index 404b0fa91..d80ecc0d5 100644
--- a/docs/operation/grafana-fodc-nodes.json
+++ b/docs/operation/grafana-fodc-nodes.json
@@ -659,7 +659,7 @@
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
- "description": "Streaming pod-to-pod flows (excludes tier-migration
edges; see the Migration Flows panel below for those). Each row is one directed
source→target per (group, operation). Pub side is the publisher container; Sub
side is the receiver. Units are per-second (rate over $__range). Pub msg/s and
Sub msg/s should match side-to-side; a populated Pub cell with an empty Sub
cell is signal — an uninstrumented side or missing scrape target. p99 latencies
are histogram-quantile o [...]
+ "description": "Streaming pod-to-pod flows (excludes tier-migration
edges; see the Migration Flows panel below for those). Each row is one directed
source\u2192target per (group, operation). Pub side is the publisher container;
Sub side is the receiver. Units are per-second (rate over $__range). Pub msg/s
and Sub msg/s should match side-to-side; a populated Pub cell with an empty Sub
cell is signal \u2014 an uninstrumented side or missing scrape target. p99
latencies are histogram- [...]
"fieldConfig": {
"defaults": {
"color": {
@@ -909,7 +909,7 @@
},
"editorMode": "code",
"exemplar": false,
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_started{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role!~\"lifecycle\"}[$__range]) or
rate(banyandb_queue_sub_total_started{job=~\"$job\", remote_role=~\"$role\",
remote_node=~\"($pod)\\\\..*\", remote_role!~\"lifecycle\"}[$__range]),
\"source\", \"$1\", \"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\",
\"pod_name\", \"(.*)\"))",
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_message_started{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role!~\"lifecycle\"}[$__range]) or
rate(banyandb_queue_sub_total_message_started{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\",
remote_role!~\"lifecycle\"}[$__range]), \"source\", \"$1\", \"remote_node\",
\"([^.:]+).*\"), \"target\", \"$1\", \"pod_name\", \"(.*)\"))",
"format": "table",
"instant": true,
"legendFormat": "__auto",
@@ -999,474 +999,6 @@
"legendFormat": "__auto",
"range": false,
"refId": "H"
- }
- ],
- "title": "Flows — Publisher vs Subscriber View",
- "transformations": [
- {
- "id": "merge",
- "options": {}
- },
- {
- "id": "organize",
- "options": {
- "excludeByName": {
- "Time": true
- },
- "indexByName": {
- "source": 0,
- "target": 1,
- "operation": 2,
- "Value #A": 3,
- "Value #B": 4,
- "Value #C": 5,
- "Value #D": 6,
- "Value #E": 7,
- "Value #F": 8,
- "Value #G": 9,
- "Value #H": 10
- },
- "renameByName": {
- "source": "Source",
- "target": "Target",
- "operation": "Operation",
- "Value #A": "Pub msg/s",
- "Value #B": "Sub msg/s",
- "Value #C": "Pub p99",
- "Value #D": "Sub p99",
- "Value #E": "Pub err/s",
- "Value #F": "Sub err/s",
- "Value #G": "Pub B/s",
- "Value #H": "Sub B/s"
- }
- }
- }
- ],
- "type": "table"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "description": "Cross-tier lifecycle migration flows (hot→warm,
warm→cold). Each row is one directed source→target per (group, operation). Pub
side is the lifecycle sidecar inside the SOURCE data pod (publishes via
banyandb_lifecycle_migration_*); Sub side is the data pod RECEIVING (records
via banyandb_queue_sub_total_started{remote_role=\"lifecycle\"}).
Counts/bytes/errors are per-day (rate × 86400) because tier migration is a
daily-batch workload and an instant per-second rate i [...]
- "fieldConfig": {
- "defaults": {
- "color": {
- "mode": "thresholds"
- },
- "custom": {
- "align": "auto",
- "cellOptions": {
- "type": "auto"
- },
- "filterable": false,
- "inspect": false
- },
- "decimals": 2,
- "mappings": [
- {
- "type": "special",
- "options": {
- "match": "nan",
- "result": {
- "text": "-",
- "index": 0
- }
- }
- }
- ],
- "noValue": "-",
- "thresholds": {
- "mode": "absolute",
- "steps": [
- {
- "color": "text",
- "value": null
- }
- ]
- },
- "unit": "short",
- "overrides": [
- {
- "matcher": {
- "id": "byName",
- "options": "Pub msg/day"
- },
- "properties": [
- {
- "id": "unit",
- "value": "short"
- },
- {
- "id": "decimals",
- "value": 2
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Sub msg/day"
- },
- "properties": [
- {
- "id": "unit",
- "value": "short"
- },
- {
- "id": "decimals",
- "value": 2
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Pub err/day"
- },
- "properties": [
- {
- "id": "unit",
- "value": "short"
- },
- {
- "id": "decimals",
- "value": 2
- },
- {
- "id": "thresholds",
- "value": {
- "mode": "absolute",
- "steps": [
- {
- "color": "text",
- "value": null
- },
- {
- "color": "red",
- "value": 1e-06
- }
- ]
- }
- },
- {
- "id": "custom.cellOptions",
- "value": {
- "type": "color-text"
- }
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Sub err/day"
- },
- "properties": [
- {
- "id": "unit",
- "value": "short"
- },
- {
- "id": "decimals",
- "value": 2
- },
- {
- "id": "thresholds",
- "value": {
- "mode": "absolute",
- "steps": [
- {
- "color": "text",
- "value": null
- },
- {
- "color": "red",
- "value": 1e-06
- }
- ]
- }
- },
- {
- "id": "custom.cellOptions",
- "value": {
- "type": "color-text"
- }
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Pub B/day"
- },
- "properties": [
- {
- "id": "unit",
- "value": "bytes"
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Sub B/day"
- },
- "properties": [
- {
- "id": "unit",
- "value": "bytes"
- }
- ]
- }
- ]
- },
- "overrides": [
- {
- "matcher": {
- "id": "byName",
- "options": "Pub p99"
- },
- "properties": [
- {
- "id": "unit",
- "value": "s"
- },
- {
- "id": "decimals",
- "value": 3
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Sub p99"
- },
- "properties": [
- {
- "id": "unit",
- "value": "s"
- },
- {
- "id": "decimals",
- "value": 3
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Pub B/s"
- },
- "properties": [
- {
- "id": "unit",
- "value": "Bps"
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Sub B/s"
- },
- "properties": [
- {
- "id": "unit",
- "value": "Bps"
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Pub err/s"
- },
- "properties": [
- {
- "id": "unit",
- "value": "ops"
- },
- {
- "id": "custom.cellOptions",
- "value": {
- "type": "color-text"
- }
- },
- {
- "id": "thresholds",
- "value": {
- "mode": "absolute",
- "steps": [
- {
- "color": "text",
- "value": null
- },
- {
- "color": "red",
- "value": 1e-06
- }
- ]
- }
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Sub err/s"
- },
- "properties": [
- {
- "id": "unit",
- "value": "ops"
- },
- {
- "id": "custom.cellOptions",
- "value": {
- "type": "color-text"
- }
- },
- {
- "id": "thresholds",
- "value": {
- "mode": "absolute",
- "steps": [
- {
- "color": "text",
- "value": null
- },
- {
- "color": "red",
- "value": 1e-06
- }
- ]
- }
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Source"
- },
- "properties": [
- {
- "id": "custom.filterable",
- "value": true
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Target"
- },
- "properties": [
- {
- "id": "custom.filterable",
- "value": true
- }
- ]
- },
- {
- "matcher": {
- "id": "byName",
- "options": "Operation"
- },
- "properties": [
- {
- "id": "custom.filterable",
- "value": true
- }
- ]
- }
- ]
- },
- "gridPos": {
- "h": 13,
- "w": 24,
- "x": 0,
- "y": 30
- },
- "id": 62,
- "options": {
- "cellHeight": "sm",
- "footer": {
- "countRows": false,
- "fields": "",
- "reducer": [
- "sum"
- ],
- "show": false
- },
- "showHeader": true,
- "sortBy": [
- {
- "desc": true,
- "displayName": "Pub msg/s"
- }
- ]
- },
- "pluginVersion": "11.2.0",
- "targets": [
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "A",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_total_finished{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) * 86400 or
rate(banyandb_lifecycle_migration_total_finished{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400,
\"source\", \"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\",
\"remote_node\", \"([^.:]+).*\"))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "B",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_started{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role=\"lifecycle\"}[$__range]) * 86400 or
rate(banyandb_queue_sub_total_started{job=~\"$job\", remote_role=\"lifecycle\",
remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400, \"source\", \"$1\",
\"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\", \"pod_name\", \"(.*)\"))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "C",
- "expr": "histogram_quantile(0.99, sum by (le, source, target,
operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_total_latency_bucket{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) or
rate(banyandb_lifecycle_migration_total_latency_bucket{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\"}[$__range]), \"source\",
\"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\", \"remote_node\",
\"([^.:]+).*\")))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "D",
- "expr": "histogram_quantile(0.99, sum by (le, source, target,
operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_latency_bucket{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role=\"lifecycle\"}[$__range]) or
rate(banyandb_queue_sub_total_latency_bucket{job=~\"$job\",
remote_role=\"lifecycle\", remote_node=~\"($pod)\\\\..*\"}[$__range]),
\"source\", \"$1\", \"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\",
\"pod_name\", \"(.*)\")))"
},
{
"datasource": {
@@ -1475,12 +1007,12 @@
},
"editorMode": "code",
"exemplar": false,
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_pub_total_batch_finished{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) or
rate(banyandb_queue_pub_total_batch_finished{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\"}[$__range]), \"source\",
\"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\", \"remote_node\",
\"([^.:]+).*\"))",
"format": "table",
"instant": true,
"legendFormat": "__auto",
"range": false,
- "refId": "E",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_total_err{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) * 86400 or
rate(banyandb_lifecycle_migration_total_err{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400,
\"source\", \"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\",
\"remote_node\", \"([^.:]+).*\"))"
+ "refId": "I"
},
{
"datasource": {
@@ -1489,43 +1021,15 @@
},
"editorMode": "code",
"exemplar": false,
+ "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_batch_started{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role!~\"lifecycle\"}[$__range]) or
rate(banyandb_queue_sub_total_batch_started{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\",
remote_role!~\"lifecycle\"}[$__range]), \"source\", \"$1\", \"remote_node\",
\"([^.:]+).*\"), \"target\", \"$1\", \"pod_name\", \"(.*)\"))",
"format": "table",
"instant": true,
"legendFormat": "__auto",
"range": false,
- "refId": "F",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_total_err{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role=\"lifecycle\"}[$__range]) * 86400 or
rate(banyandb_queue_sub_total_err{job=~\"$job\", remote_role=\"lifecycle\",
remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400, \"source\", \"$1\",
\"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\", \"pod_name\", \"(.*)\"))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "G",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_lifecycle_migration_sent_bytes{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\"}[$__range]) * 86400 or
rate(banyandb_lifecycle_migration_sent_bytes{job=~\"$job\",
remote_role=~\"$role\", remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400,
\"source\", \"$1\", \"pod_name\", \"(.*)\"), \"target\", \"$1\",
\"remote_node\", \"([^.:]+).*\"))"
- },
- {
- "datasource": {
- "type": "prometheus",
- "uid": "${DS_PROMETHEUS}"
- },
- "editorMode": "code",
- "exemplar": false,
- "format": "table",
- "instant": true,
- "legendFormat": "__auto",
- "range": false,
- "refId": "H",
- "expr": "sum by (source, target, operation)
(label_replace(label_replace(rate(banyandb_queue_sub_received_bytes{job=~\"$job\",
container_name=~\"$role\", pod_name=~\"$pod\",
remote_role=\"lifecycle\"}[$__range]) * 86400 or
rate(banyandb_queue_sub_received_bytes{job=~\"$job\",
remote_role=\"lifecycle\", remote_node=~\"($pod)\\\\..*\"}[$__range]) * 86400,
\"source\", \"$1\", \"remote_node\", \"([^.:]+).*\"), \"target\", \"$1\",
\"pod_name\", \"(.*)\"))"
+ "refId": "J"
}
],
- "title": "Migration Flows — Tier Migrations (lifecycle)",
+ "title": "Flows \u2014 Publisher vs Subscriber View",
"transformations": [
{
"id": "merge",
@@ -1548,20 +1052,24 @@
"Value #E": 7,
"Value #F": 8,
"Value #G": 9,
- "Value #H": 10
+ "Value #H": 10,
+ "Value #I": 11,
+ "Value #J": 12
},
"renameByName": {
"source": "Source",
"target": "Target",
"operation": "Operation",
- "Value #A": "Pub msg/day",
- "Value #B": "Sub msg/day",
+ "Value #A": "Pub msg/s",
+ "Value #B": "Sub msg/s",
"Value #C": "Pub p99",
"Value #D": "Sub p99",
- "Value #E": "Pub err/day",
- "Value #F": "Sub err/day",
- "Value #G": "Pub B/day",
- "Value #H": "Sub B/day"
+ "Value #E": "Pub err/s",
+ "Value #F": "Sub err/s",
+ "Value #G": "Pub B/s",
+ "Value #H": "Sub B/s",
+ "Value #I": "Pub batch/s",
+ "Value #J": "Sub batch/s"
}
}
}
@@ -1574,7 +1082,7 @@
"h": 1,
"w": 24,
"x": 0,
- "y": 43
+ "y": 30
},
"id": 13,
"panels": [],
@@ -1643,7 +1151,7 @@
"h": 8,
"w": 12,
"x": 0,
- "y": 44
+ "y": 31
},
"id": 14,
"options": {
@@ -1744,7 +1252,7 @@
"h": 8,
"w": 12,
"x": 12,
- "y": 44
+ "y": 31
},
"id": 15,
"options": {
@@ -1849,7 +1357,7 @@
"h": 8,
"w": 12,
"x": 0,
- "y": 52
+ "y": 39
},
"id": 16,
"options": {
@@ -1954,7 +1462,7 @@
"h": 8,
"w": 12,
"x": 12,
- "y": 52
+ "y": 39
},
"id": 17,
"options": {
@@ -2055,7 +1563,7 @@
"h": 8,
"w": 24,
"x": 0,
- "y": 60
+ "y": 47
},
"id": 18,
"options": {
@@ -2112,7 +1620,7 @@
"h": 1,
"w": 24,
"x": 0,
- "y": 68
+ "y": 55
},
"id": 19,
"panels": [],
@@ -2181,7 +1689,7 @@
"h": 8,
"w": 12,
"x": 0,
- "y": 69
+ "y": 56
},
"id": 20,
"options": {
@@ -2282,7 +1790,7 @@
"h": 8,
"w": 12,
"x": 12,
- "y": 69
+ "y": 56
},
"id": 21,
"options": {
@@ -2387,7 +1895,7 @@
"h": 8,
"w": 24,
"x": 0,
- "y": 77
+ "y": 64
},
"id": 22,
"options": {
@@ -2432,7 +1940,7 @@
"h": 1,
"w": 24,
"x": 0,
- "y": 85
+ "y": 72
},
"id": 50,
"panels": [],
@@ -2501,7 +2009,7 @@
"h": 8,
"w": 12,
"x": 0,
- "y": 86
+ "y": 73
},
"id": 51,
"options": {
@@ -2602,7 +2110,7 @@
"h": 8,
"w": 12,
"x": 12,
- "y": 86
+ "y": 73
},
"id": 52,
"options": {
@@ -2703,7 +2211,7 @@
"h": 8,
"w": 12,
"x": 0,
- "y": 94
+ "y": 81
},
"id": 53,
"options": {
@@ -2817,7 +2325,7 @@
"h": 8,
"w": 12,
"x": 12,
- "y": 94
+ "y": 81
},
"id": 54,
"options": {
@@ -2969,8 +2477,8 @@
},
"timepicker": {},
"timezone": "browser",
- "title": "BanyanDB Cluster — Nodes (FODC Proxy)",
+ "title": "BanyanDB Cluster \u2014 Nodes (FODC Proxy)",
"uid": "banyandb-fodc-nodes",
"version": 1,
"weekStart": ""
-}
+}
\ No newline at end of file
diff --git a/test/cases/lifecycle/lifecycle.go
b/test/cases/lifecycle/lifecycle.go
index 7271f941e..11147bd2e 100644
--- a/test/cases/lifecycle/lifecycle.go
+++ b/test/cases/lifecycle/lifecycle.go
@@ -263,7 +263,7 @@ func verifyMigrationMetrics(reg
observability.MetricsRegistry) {
// names now carry labels — Prometheus' exposition format sorts label
// names alphabetically (group, remote_node, remote_role, remote_tier),
// and the regex requires all four to be present so a regression to
- // the unlabeld form fails the regex. The label block is captured as
+ // the unlabeled form fails the regex. The label block is captured as
// a single `[^}]*` then each required label is asserted with a
// lookbehind-style positive check; an explicit per-label regex would
// be more readable but the leading-label-alphabetical-ordering
@@ -320,10 +320,10 @@ func verifyMigrationMetrics(reg
observability.MetricsRegistry) {
// either endpoint is valid. We check both, accepting whichever responds.
//
// At-least-one check: every runLifecycleMigration invocation passes
-// --grpc-addr=SharedContext.DataAddr (the hot data node), so
deriveSelfIdentity
+// --grpc-addr=SharedContext.DataAddr (the hot data node), so
resolveSelfIdentity
// resolves the sender through the registry's GrpcAddress match and the stamped
// tier is the hot node's `type` label. The lifecycle service waits for the
-// co-located node to become visible in the registry before deriving, so the
+// co-located node to become visible in the registry before resolving, so the
// match is deterministic. The assertion requires AT LEAST ONE
// banyandb_queue_sub_total_finished series to carry the populated labels,
// proving the SetSelfNode fix is wired end-to-end.
@@ -655,16 +655,6 @@ func crossSegmentTimestamps() (single, left, right
time.Time) {
return crossSrcStart.Add(-12 * time.Hour), crossSrcStart.Add(12 *
time.Hour), crossSrcStart.Add(36 * time.Hour)
}
-// runLifecycleMigration runs a single hot->warm lifecycle migration, pointing
-// every root path at the shared source dir and writing its report to
reportDir.
-// It returns the command's metrics registry so callers can verify the emitted
-// banyandb_lifecycle_migration_* family.
-//
-// The migration publisher derives its sender identity (sender_node,
sender_role,
-// sender_tier) from the data-node registry and the lifecycle's own
--node-labels
-// at runtime — no extra CLI flags needed beyond what the test setup already
-// passes via SharedContext.MetadataFlags. See deriveSelfIdentity in
-// banyand/backup/lifecycle/steps.go for the resolution rules.
// runLifecycleMigration runs a single hot->warm lifecycle migration, pointing
// the lifecycle service at the co-located data node. Returns the
MetricsRegistry
// the lifecycle service registered its metrics with so the test can scrape
them.
diff --git a/test/e2e-v2/cases/fodc/metrics/documented_gap.txt
b/test/e2e-v2/cases/fodc/metrics/documented_gap.txt
index 3bf4b494c..f8b81b2ad 100644
--- a/test/e2e-v2/cases/fodc/metrics/documented_gap.txt
+++ b/test/e2e-v2/cases/fodc/metrics/documented_gap.txt
@@ -40,9 +40,12 @@ banyandb_trace_tst_total_merged_parts
#
# Group 3 -- lifecycle metrics. The lifecycle service is not deployed in the
# e2e test cluster, so these metrics are never exported within the verify
window.
+banyandb_lifecycle_cycles_total
banyandb_lifecycle_last_run_timestamp_seconds
banyandb_lifecycle_last_run_success
banyandb_lifecycle_migration_sent_bytes
+banyandb_lifecycle_migration_total_batch_finished
+banyandb_lifecycle_migration_total_batch_latency_bucket
banyandb_lifecycle_migration_total_err
banyandb_lifecycle_migration_total_finished
banyandb_lifecycle_migration_total_latency_bucket