wu-sheng commented on code in PR #911:
URL:
https://github.com/apache/skywalking-banyandb/pull/911#discussion_r2638994161
##########
docs/design/fodc/proxy.md:
##########
@@ -0,0 +1,1278 @@
+# FODC Proxy Development Design
+
+## Overview
+
+The FODC Proxy is the central control plane and data aggregator for the First
Occurrence Data Collection (FODC) infrastructure. It acts as a unified gateway
that aggregates observability data from multiple FODC Agents (each co-located
with a BanyanDB node) and exposes ecosystem-friendly interfaces to external
systems such as Prometheus and other observability platforms.
+
+The Proxy provides:
+
+1. **Agent Management**: Registration, health monitoring, and lifecycle
management of FODC Agents
+2. **Metrics Aggregation**: Collects and aggregates metrics from all agents
with enriched metadata
+3. **Cluster Topology**: Maintains an up-to-date view of cluster topology,
roles, and node states
+4. **Configuration Collection**: Aggregates and exposes node configurations
for consistency verification
+
+### Responsibilities
+
+**FODC Proxy Core Responsibilities**
+- Accept bi-directional gRPC connections from FODC Agents
+- Register and track agent lifecycle (online/offline, heartbeat monitoring)
+- Aggregate metrics from all agents with node metadata enrichment
+- Maintain cluster topology view based on agent registrations
+- Collect and expose node configurations for audit and consistency checks
+- Expose unified REST/Prometheus-style interfaces for external consumption
+- Provide proxy-level metrics (health, agent count, RPC latency, etc.)
+
+## Component Design
+
+### 1. Proxy Components
+
+#### 1.1 Agent Registry Component
+
+**Purpose**: Manages the lifecycle and state of all connected FODC Agents
+
+##### Core Responsibilities
+
+- **Agent Registration**: Accepts agent registration requests via gRPC
+- **Health Monitoring**: Tracks agent heartbeat and connection status
+- **State Management**: Maintains agent state (online/offline, last heartbeat
time)
+- **Topology Building**: Aggregates agent registrations into cluster topology
view
+- **Connection Management**: Handles connection failures, reconnections, and
cleanup
+
+##### Core Types
+
+**`AgentInfo`**
+```go
+type AgentInfo struct {
+ NodeID string // Unique node identifier
+ NodeRole databasev1.Role // Node role (liaison,
datanode-hot, etc.)
+ Address string // Agent gRPC address
+ Labels map[string]string // Node labels/metadata
+ RegisteredAt time.Time // Registration timestamp
+ LastHeartbeat time.Time // Last heartbeat timestamp
+ Status AgentStatus // Current agent status
+}
+
+type AgentStatus string
+
+const (
+ AgentStatusOnline AgentStatus = "online"
+ AgentStatusOffline AgentStatus = "offline"
+ AgentStatusUnknown AgentStatus = "unknown"
+)
+```
+
+**`AgentRegistry`**
+```go
+type AgentRegistry struct {
+ agents map[string]*AgentInfo // Map from node ID to agent info
+ mu sync.RWMutex // Protects agents map
+ logger *logger.Logger
+ heartbeatTimeout time.Duration // Timeout for considering agent
offline
+}
+```
+
+##### Key Functions
+
+**`RegisterAgent(ctx context.Context, info *AgentInfo) error`**
+- Registers a new agent or updates existing agent information
+- Validates node ID and role
+- Updates topology view
+- Returns error if registration fails
+
+**`UnregisterAgent(nodeID string) error`**
+- Removes agent from registry
+- Cleans up associated resources
+- Updates topology view
+- Called in the following scenarios:
+ - When agent's registration stream closes (connection lost)
+ - When agent's all streams are closed and connection is terminated
+ - When agent has been offline for extended period (cleanup after heartbeat
timeout)
+ - During graceful shutdown or manual agent removal
+ - When agent explicitly requests unregistration via stream
+
+**`UpdateHeartbeat(nodeID string) error`**
+- Updates last heartbeat timestamp for agent
+- Marks agent as online if it was offline
+- Returns error if agent not found
+
+**`GetAgent(nodeID string) (*AgentInfo, error)`**
+- Retrieves agent information by node ID
+- Returns error if agent not found
+
+**`ListAgents() []*AgentInfo`**
+- Returns list of all registered agents
+- Thread-safe read operation
+
+**`ListAgentsByRole(role databasev1.Role) []*AgentInfo`**
+- Returns agents filtered by role
+- Useful for role-specific operations
+
+**`CheckAgentHealth() error`**
+- Periodically checks agent health based on heartbeat timeout
+- Marks agents as offline if heartbeat timeout exceeded
+- Optionally unregisters agents that have been offline for extended period (if
`--agent-cleanup-timeout` is configured)
+- Returns aggregated health status
+
+##### Configuration Flags
+
+**`--agent-heartbeat-timeout`**
+- **Type**: `duration`
+- **Default**: `30s`
+- **Description**: Timeout for considering an agent offline if no heartbeat
received
+
+**`--max-agents`**
+- **Type**: `int`
+- **Default**: `1000`
+- **Description**: Maximum number of agents that can be registered
+
+**`--agent-cleanup-timeout`**
+- **Type**: `duration`
+- **Default**: `0` (disabled, agents are not auto-unregistered)
+- **Description**: Timeout for automatically unregistering agents that have
been offline. If set to 0, agents remain registered even when offline. If set
to a positive duration, agents offline longer than this timeout will be
unregistered.
+
+#### 1.2 gRPC Server Component
+
+**Purpose**: Handles bi-directional gRPC communication with FODC Agents
+
+##### Core Responsibilities
+
+- **Agent Connection Handling**: Accepts and manages gRPC connections from
agents
+- **Streaming Support**: Supports bi-directional streaming for metrics
+- **Protocol Implementation**: Implements FODC gRPC service protocol
+- **Connection Lifecycle**: Manages connection establishment, maintenance, and
cleanup
+
+##### Core Types
+
+**`FODCService`** (gRPC Service Implementation)
+```go
+type FODCService struct {
+ registry *AgentRegistry
+ metricsAggregator *MetricsAggregator
+ configCollector *ConfigurationCollector
+ logger *logger.Logger
+}
+
+// gRPC service methods (to be defined in proto)
+// RegisterAgent(stream RegisterRequest) (stream RegisterResponse) error
+// StreamMetrics(stream MetricsMessage) (stream MetricsRequest) error
+// StreamNodesConfigurations(stream ConfigurationMessage) (stream
ConfigurationRequest) error
+```
+
+**`AgentConnection`**
+```go
+type AgentConnection struct {
+ NodeID string
+ Stream grpc.ServerStream
+ Context context.Context
+ Cancel context.CancelFunc
+ LastActivity time.Time
+}
+```
+
+##### Key Functions
+
+**`RegisterAgent(stream FODCService_RegisterAgentServer) error`**
+- Handles bi-directional agent registration stream
+- Receives registration requests from agent
+- Validates registration information
+- Registers agent with AgentRegistry
+- Sends registration responses with assigned session ID
+- Maintains stream for heartbeat and re-registration
+
+**`StreamMetrics(stream FODCService_StreamMetricsServer) error`**
+- Handles bi-directional metrics streaming
+- Receives metrics requests from Proxy (on-demand collection)
+- Sends metrics from agent to Proxy when requested
+- Proxy sends metrics request via MetricsRequest when external client queries
metrics
+- Agent responds with MetricsMessage containing collected metrics
+- Manages stream lifecycle
+
+**`StreamNodesConfigurations(stream
FODCService_StreamNodesConfigurationsServer) error`**
+- Handles bi-directional configuration streaming
+- Receives configuration requests from Proxy (on-demand collection)
+- Sends configuration from agent to Proxy when requested
+- Proxy sends configuration request via ConfigurationRequest when external
client queries configurations
+- Agent responds with ConfigurationMessage containing collected configuration
+- Forwards configuration to ConfigurationCollector for storage
+- Manages stream lifecycle
+
+##### Connection Lifecycle Management
+
+**Stream Closure Handling**
+- When a stream closes (due to network error, agent shutdown, or timeout), the
gRPC server should:
+ 1. Detect stream closure via context cancellation or stream error
+ 2. Check if this is the last active stream for the agent
+ 3. If all streams are closed, call `AgentRegistry.UnregisterAgent()` to
clean up
+ 4. Update topology to reflect agent offline status
+
+**Graceful vs. Ungraceful Disconnection**
+- **Graceful**: Agent sends explicit disconnect message before closing stream
→ immediate unregistration
+- **Ungraceful**: Stream closes unexpectedly → unregistration happens after
detecting all streams closed
+- **Heartbeat Timeout**: Agent marked offline by `CheckAgentHealth()` →
unregistration may occur after extended offline period (configurable)
+
+##### Configuration Flags
+
+**`--grpc-listen-addr`**
+- **Type**: `string`
+- **Default**: `:17900`
+- **Description**: gRPC server address where the Proxy listens for agent
connections
+
+**`--grpc-max-msg-size`**
+- **Type**: `int`
+- **Default**: `4194304` (4MB)
+- **Description**: Maximum message size for gRPC messages
+
+#### 1.3 Metrics Aggregator Component
+
+**Purpose**: Aggregates and enriches metrics from all agents
+
+##### Core Responsibilities
+
+- **On-Demand Metrics Collection**: Requests metrics from agents via gRPC
streams when external clients query metrics
+- **Metrics Request Coordination**: Coordinates metrics requests to multiple
agents concurrently
+- **Metadata Enrichment**: Adds node metadata (role, ID, labels) to metrics
+- **Time Window Management**: Maintains sliding window of metrics for
`/metrics-windows` endpoint
+- **Normalization**: Normalizes metric formats and labels across agents
+- **Storage**: Stores aggregated metrics with time-series context
Review Comment:
By that, one day, when we need FODC proxy in the cluster mode, we could
allow FODC agent to register to multiple proxies.
This kind of light-weight architecture is more helpful in the future.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]