GitHub user addu390 created a discussion: [Discussion] Agent observability: 
tracing & evaluation beyond metrics + event log

## Context

Flink Agents today ships three observability surfaces:

1. **Metrics**: event/action counters, token usage per model, plus a 
custom-metrics API, via Flink metric reporter.
2. **Event log**: structured records per event, with log levels.
3. **EventListener**: per-event callback, configured via `event-listeners`.

That covers aggregates (metrics), audit records (event log), and hooks 
(listener). What it doesn't cover is reconstructing a single run as a causal 
tree:

```
InputEvent
 └─ action: classify
      └─ ChatRequest → ChatResponse
 └─ action: tool_use
      └─ ToolRequest → ToolResponse
      └─ ChatRequest → ChatResponse
 └─ OutputEvent
```

That tree shape is what `LangSmith` (and Langfuse, Phoenix, etc) render for 
debugging, or where `MLflow` slots into for batch eval runs (one run = one 
trace, with metrics, prompts, and outputs tracked across versions of the 
agent). There is also `OpenTelemetry` GenAI semantic conventions. Makes 
debugging and battle-testing non-trivial agent workflows tractable.

## Scope

Not proposing this as a default for production streaming jobs. Full-fidelity 
per-event tracing at streaming QPS is too heavy, metrics + event log stay the 
right production defaults.

The question is whether the framework should provide first-class support for 
tracing where it actually pays off:

- Local authoring loop
- CI / batch eval and replay
- Staging mini-cluster runs
- Canaried / sampled production

In all four, you want a single run rendered end-to-end. Today you piece it 
together from event-log records.

## Open question

Is this worth the framework solving? Curious if others hit this in their own 
dev loop.

GitHub link: https://github.com/apache/flink-agents/discussions/710

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to