GitHub user addu390 created a discussion: [Discussion] Agent observability:
tracing & evaluation beyond metrics + event log
## Context
Flink Agents today ships three observability surfaces:
1. **Metrics**: event/action counters, token usage per model, plus a
custom-metrics API, via Flink metric reporter.
2. **Event log**: structured records per event, with log levels.
3. **EventListener**: per-event callback, configured via `event-listeners`.
That covers aggregates (metrics), audit records (event log), and hooks
(listener). What it doesn't cover is reconstructing a single run as a causal
tree:
```
InputEvent
└─ action: classify
└─ ChatRequest → ChatResponse
└─ action: tool_use
└─ ToolRequest → ToolResponse
└─ ChatRequest → ChatResponse
└─ OutputEvent
```
That tree shape is what `LangSmith` (and Langfuse, Phoenix, etc) render for
debugging, or where `MLflow` slots into for batch eval runs (one run = one
trace, with metrics, prompts, and outputs tracked across versions of the
agent). There is also `OpenTelemetry` GenAI semantic conventions. Makes
debugging and battle-testing non-trivial agent workflows tractable.
## Scope
Not proposing this as a default for production streaming jobs. Full-fidelity
per-event tracing at streaming QPS is too heavy, metrics + event log stay the
right production defaults.
The question is whether the framework should provide first-class support for
tracing where it actually pays off:
- Local authoring loop
- CI / batch eval and replay
- Staging mini-cluster runs
- Canaried / sampled production
In all four, you want a single run rendered end-to-end. Today you piece it
together from event-log records.
## Open question
Is this worth the framework solving? Curious if others hit this in their own
dev loop.
GitHub link: https://github.com/apache/flink-agents/discussions/710
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]