GitHub user xintongsong added a comment to the discussion: [Discussion] Agent 
observability: tracing & evaluation beyond metrics + event log

Hi @addu390,

+1 on this. Being able to reconstruct and visualize the trace of a single run 
is really helpful for observing and understanding agent behavior, and it's 
exactly the gap that neither metrics nor the event log fills today.

I'd suggest splitting this into two fairly independent problems:
- **Recording:** keep enough information in the event log to reconstruct the 
causal tree.
- **Reconstruction & visualization:** rebuild and render a single run from that 
information.

Once you split it this way, the overhead concern mostly lands on the first 
part, and I think that part is actually pretty light. We already record every 
event in the event log. To reconstruct the causal tree, we basically just need 
one extra field per event: which action emitted the event. With that, the whole 
run can be rebuilt from the event log.

The reverse edges (which actions an event triggered) don't even need to be 
recorded explicitly. They can be derived from the action trigger rules plus 
timestamps. So the increment on the recording side is small, and I think it can 
be kept separate from the concern about full-fidelity tracing being too heavy 
at streaming QPS.

The second part can be fully on-demand rather than always-on. We only run it 
when needed, e.g. local debugging, CI/eval, or staging. The rendering form 
doesn't have to be a tree either; PlantUML or something else would work too, 
and we can discuss that separately. As long as the recording side captures the 
necessary info, there's a lot of flexibility in what we do with it later.

GitHub link: 
https://github.com/apache/flink-agents/discussions/710#discussioncomment-17088983

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to