(skywalking-banyandb) 01/01: Add FODC overview doc for end users.

wusheng Wed, 10 Dec 2025 23:13:05 -0800

This is an automated email from the ASF dual-hosted git repository.

wusheng pushed a commit to branch fodc
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git


commit c0acf064bd8132eb99b317d8825584377b44eebf
Author: Wu Sheng <[email protected]>
AuthorDate: Thu Dec 11 15:11:35 2025 +0800

    Add FODC overview doc for end users.
---
 docs/menu.yml                   |   4 +
 docs/operation/fodc/overview.md | 274 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 278 insertions(+)

diff --git a/docs/menu.yml b/docs/menu.yml
index 78391ba0..d3e13a48 100644
--- a/docs/menu.yml
+++ b/docs/menu.yml
@@ -149,6 +149,10 @@ catalog:
             path: "/operation/mcp/inspector"
           - name: "Build and Package"
             path: "/operation/mcp/build"
+      - name: "First Occurrence Data Collection (FODC)"
+        catalog:
+          - name: "Overview"
+            path: "/operation/fodc/overview"
       - name: "Property Background Repair"
         path: "/concept/property-repair"
       - name: "Benchmark"
diff --git a/docs/operation/fodc/overview.md b/docs/operation/fodc/overview.md
new file mode 100644
index 00000000..b6965746
--- /dev/null
+++ b/docs/operation/fodc/overview.md
@@ -0,0 +1,274 @@
+```markdown
+# First Occurrence Data Collection (FODC)
+
+First Occurrence Data Collection (FODC) is an observability and diagnostics 
subsystem for BanyanDB.  
+It continuously collects runtime parameters, performance indicators, node 
states, and configuration
+data from DB nodes, and also supports **on‑demand** performance profiling and 
memory snapshots.
+
+FODC has two primary goals:
+
+1. Ensure the stable operation of DB nodes across all lifecycle stages 
(bootstrap, steady state, scaling, failure, etc.).
+2. Provide trustworthy, structured data that supports capacity planning, 
performance analysis, and troubleshooting.
+
+FODC adopts a **Proxy + Agent** deployment model and exposes a unified, 
ecosystem‑friendly data interface to external systems
+(such as Prometheus and other observability platforms).
+
+---
+
+## Overview
+
+FODC provides multiple categories of data:
+
+1. **Metric collection and short‑term caching** (small time window)
+2. **Node topology and status**, including runtime parameters and role states
+3. **Node configuration collection**
+4. **On‑demand performance profiling and memory snapshots**
+
+To accomplish this, FODC is deployed as:
+
+- A **central Proxy** service
+- Multiple **Agents**, typically co‑located with BanyanDB nodes (sidecar 
pattern)
+
+Agents connect to the Proxy via gRPC and register themselves. The Proxy then:
+
+- Aggregates and normalizes data from all agents
+- Exposes unified REST/Prometheus‑style interfaces
+- Issues **on‑demand diagnostic commands** (profiling, snapshots, config 
capture, etc.) to one or more agents
+
+---
+
+## Architecture
+
+### Deployment Model
+
+FODC uses a **one‑to‑one mapping** between an Agent and a BanyanDB node:
+
+- Each **liaison** node has one FODC Agent
+- Each **data node** (hot / warm / cold) has one FODC Agent
+- In manual deployment modes, the same 1:1 relationship must be preserved
+
+Agents are typically deployed as **sidecars** in the same pod or host as the 
corresponding BanyanDB node.
+
+### Proxy–Agent Relationship
+
+The Proxy acts as the **control plane and data aggregator**, while Agents act 
as the **data plane** local to each DB node.
+
+Key characteristics:
+
+- Agents connect to the Proxy using **gRPC** and a configured **Proxy domain 
name**.
+- The connection is **bi‑directional**:
+  - Agents stream node metrics, status, and configuration to the Proxy.
+  - The Proxy sends on‑demand diagnostic commands back to the Agents.
+
+#### ASCII Architecture Diagram
+
+```text
+                 +-----------------------------------------+
+                 |              FODC Proxy                 |
+                 |-----------------------------------------|
+                 |  - Agent registry                       |
+                 |  - Cluster topology view                |
+External         |  - Aggregated metrics (/metrics         |   
+Clients &  <---- |    and /metrics-windows)                | 
+Ecosystem        |  - Config view (/cluster/config)        |
+                 |  - On-demand control APIs               |
+                 +-----------------^-----------------------+
+                                   |
+                          gRPC bi-directional streams
+                                   |
+        -----------------------------------------------------------------
+        |                               |                               |
+        v                               v                               v
++------------------+          +------------------+            
+------------------+
+|   FODC Agent     |          |   FODC Agent     |            |   FODC Agent   
  |
+|  (sidecar with   |          |  (sidecar with   |            |  (sidecar with 
  |
+|  liaison node)   |          | data node - hot) |            | data node - 
warm)|
+|------------------|          |------------------|            
|------------------|
+| - Scrape local   |          | - Scrape local   |            | - Scrape local 
  |
+|   Prometheus     |          |   Prometheus     |            |   Prometheus   
  |
+|   metrics        |          |   metrics        |            |   metrics      
  |
+| - Collect OS &   |          | - KTM / OS obs   |            | - KTM / OS obs 
  |
+|   KTM telemetry  |          |   metrics        |            |   metrics      
  |
+| - Execute on-    |          | - Execute on-    |            | - Execute on-  
  |
+|   demand profile |          |   demand profile |            |   demand 
profile |
+|   & heap dump    |          |   & heap dump    |            |   & heap dump  
  |
++--------^---------+          +--------^---------+            
+--------^---------+
+         |                               |                               |
+         |                               |                               |
+  +------+--------+               +------+--------+               
+------+--------+
+  | BanyanDB      |               | BanyanDB      |               | BanyanDB   
   |
+  | liaison node  |               | data node     |               | data node  
   |
+  | (process)     |               | (hot tier)    |               | (warm 
tier)   |
+  +---------------+               +---------------+               
+---------------+
+```
+
+> Additional data node tiers (e.g. `datanode-cold`) follow the same **BanyanDB 
node ↔ FODC Agent** 1:1 pattern.
+
+---
+
+## Metric Collection and Prometheus Integration
+
+### Data Sources
+
+Each FODC Agent collects metrics from:
+
+1. **The local BanyanDB node**
+    - Via its Prometheus `/metrics` HTTP endpoint
+    - Includes DB performance, internal queues, I/O stats, query latency, etc.
+2. **Kernel & OS‑level telemetry**
+    - Through an integrated **Kernel Telemetry Module (KTM)** powered by eBPF
+    - Examples: OS page cache statistics, system I/O latency, CPU scheduling 
behavior
+
+### In‑Memory Sliding Window Cache
+
+Agents maintain a **sliding window** of recent metric samples in memory:
+
+- A **wake‑up queue** is used to buffer the last **N** collections.
+- The time window is **auto‑tuned** at startup based on:
+    - Sample interval
+    - Number of metrics
+    - Available memory constraints
+- Target memory usage is kept low (around **30 MB** per Agent) while still 
supporting:
+    - Short‑term trend analysis
+    - Correlation during incident triage (e.g. spikes around first occurrence)
+
+This design allows FODC to provide recent time‑series context **without 
depending on an external TSDB**.
+
+### Agent Metric Exposure
+
+Each FODC Agent exposes a **Prometheus‑compatible** endpoint:
+
+- `GET /metrics`
+    - Returns the **latest** scraped metrics and local telemetry
+    - Can be scraped directly by:
+        - The FODC Proxy
+        - External observability systems (if desired and authorized)
+
+### Proxy Metric Exposure
+
+The FODC Proxy provides aggregated / enriched metric endpoints:
+
+- `GET /metrics`
+    - Proxy’s own metrics (health, number of agents, RPC latency, etc.)
+- `GET /metrics-windows`
+    - Returns metrics **within the maintained time window** for all known 
agents
+    - Includes **additional node metadata**, such as:
+        - Node role (`liaison`, `datanode-hot`, `datanode-warm`, 
`datanode-cold`, etc.)
+        - Node IDs and cluster membership
+        - Location / shard information (if available)
+
+This makes FODC a **drop‑in component** for Prometheus‑based ecosystems, while 
preserving richer semantic context about each node.
+
+---
+
+## Cluster Topology, Roles, and Runtime State
+
+The FODC Proxy maintains an up‑to‑date view of cluster topology based on 
**Agent registration**:
+
+1. On startup, each Agent:
+    - Connects to the Proxy via gRPC
+    - Registers its:
+        - Node ID and role
+        - Basic runtime attributes and capabilities
+2. The Proxy aggregates these registrations into a **logical cluster hierarchy 
structure**.
+
+### Topology & Status API
+
+The Proxy exposes a unified cluster discovery endpoint:
+
+- `GET /cluster`
+    - Returns the list of all registered nodes
+    - Includes:
+        - Node identity (ID, name, address)
+        - Role:
+            - `liaison`
+            - `datanode-hot`
+            - `datanode-warm`
+            - `datanode-cold`
+            - (extensible for future roles)
+        - Agent status (online/offline, last heartbeat time)
+        - Key runtime indicators (optional: load, health, etc.)
+
+This simplifies integration with:
+
+- Cluster dashboards
+- Automated operations (e.g. scheduled checks before resharding / scaling)
+- Higher-level diagnostics tooling that needs a consistent cluster graph
+
+---
+
+## Node Configuration Collection
+
+FODC also collects and exposes static and dynamic configuration for each node.
+
+### What Is Collected
+
+Typical configuration categories include:
+
+- **Startup parameters**
+    - Command‑line flags
+    - Environment variables (where permitted)
+- **Affected configurations**
+    - Configurations used by the database node in the runtime.
+
+### Configuration API
+
+The FODC Proxy aggregates configuration from all Agents and exposes it via:
+
+- `GET /cluster/config`
+    - Returns configuration for all nodes in the cluster
+- Potential filters (implementation‑dependent):
+    - `GET /cluster/config?node_id=<id>`
+    - `GET /cluster/config?role=datanode-hot`
+
+This allows:
+
+- Quickly verifying configuration consistency across nodes and tiers
+- Comparing pre‑ and post‑incident configuration states
+- Supporting automated configuration audits and drift detection
+
+---
+
+## On‑Demand Performance Profiling and Memory Snapshots
+
+On‑demand diagnostics are the **first non read‑only capability** exposed by 
FODC.
+They enable deep performance analysis while carefully controlling overhead.
+
+### Design Principles
+
+- **Opt‑in and controlled**  
+  Diagnostic actions are only triggered through explicit API calls to the 
Proxy.
+- **Local execution, remote control**  
+  Agents perform the heavy work (profiling, snapshots) on the local node; the 
Proxy only orchestrates.
+- **Low default footprint**  
+  By default, Agents run in **low CPU / low memory** mode and do not perform 
expensive diagnostics.
+- **Burst resource usage when needed**  
+  Extra CPU/memory budget is mainly consumed **only during active profiling or 
snapshot sessions**.
+
+### Typical On‑Demand Actions
+
+Exact APIs may vary by implementation, but commonly include:
+
+- **CPU profiling**
+    - Short‑term CPU usage profiling (e.g. pprof)
+    - Useful for identifying hot code paths under load
+- **Heap / memory snapshots**
+    - Captures heap allocation state for leak or fragmentation analysis
+- **I/O / lock contention profiling**
+    - Optional profiling of DB internal lock contention or I/O stalls
+- **Configuration snapshot on demand**
+    - Force a re‑capture of configuration and runtime flags at a specific 
point in time
+- **Apply RBAC or other authorization controls on Proxy APIs**
+
+---
+
+## Summary
+
+FODC provides:
+
+- **Unified, structured observability** for BanyanDB clusters (metrics, 
topology, configuration)
+- **Prometheus‑friendly** interfaces for easy ecosystem integration
+- **On‑demand deep diagnostics** (profiling, memory snapshots) orchestrated 
centrally but executed locally
+- A lightweight, extensible **Proxy + Agent** architecture that respects 
resource constraints
+
+This makes FODC a foundational component for reliable operation, performance 
analysis, and automated troubleshooting of BanyanDB deployments.
\ No newline at end of file

(skywalking-banyandb) 01/01: Add FODC overview doc for end users.

Reply via email to